Moving Beyond Bots: MTurk as a Source of High Quality Data

Leib Litman, PhD

published on October 29, 2018

By Leib Litman, PhD, Jonathan Robinson, PhD, Aaron Moss, PhD, & Richa Gautam

Highlights

  • We collected high quality data on MTurk when using TurkPrime’s IP address and Geocode-restricting tools.

  • Using a novel format for our anchoring manipulation, we found that Turkers are highly attentive, even under taxing conditions.

  • After querying the TurkPrime database, we found that farmer activity has significantly decreased over the last month.

  • When used the right way, researchers can be confident they are collecting quality data on MTurk.

  • We are continuously monitoring and maintaining data quality on MTurk.

  • Starting this month, we will be conducting monthly surveys of data quality on Mechanical Turk.

About a month ago, we published After the Bot Scare blog on workers providing bad quality data on Amazon’s Mechanical Turk. This month, we followed up with our “farmers” to assess the effectiveness of the tools we created to deal with the problem. In this blog, we present data from our follow-up study and evidence to suggest our tools are working.

For those who have not followed the conversation about “bots” on MTurk, our previous blog post provides a complete overview.


Data Quality Month 2: Blocking Farmers

Our investigation was almost identical to the one from last month. We ran two studies on MTurk, one in which we used the list of farmers we identified last time (these were workers who had taken 80% or more of their HITs from server farms), and another in which we collected data from a standard MTurk sample (i.e. workers who have greater than a 95% approval rating and more than 100 HITs completed—commonly used qualifications). For the standard MTurk sample, we used our “Block Duplicate IP Addresses,” “Block Suspicious Locations,” and “Block Duplicate Geolocations” tools and excluded workers who took our data quality study last month.

Both HITs paid $2 and were 20 minutes long. We used the same measures (with some minor modifications discussed below), with the same rationale, as last time.

We easily recruited 100 participants for the non-farmer group, but had trouble recruiting enough farmers into our study. The study was open to 408 farmers. Although we aimed for 100 farmers to complete the study, we were only able to collect 55 responses over 4 days. In order to ascertain the reason for low recruitment, we looked at the activity from server farms and the activity of known farmers on our list over the last month. We found that activity from server farms was in steady decline. Between the first half of August and the first half of September, activity from server farms declined by about 80% (we continued to see this decline for the second part of September and the first part of October). This suggests that many farmers who were previously active on MTurk are no longer active on those accounts.


The Survey

We made one small change in the survey. We embedded the Mt. Everest anchoring manipulation within the Big Five Inventory questions, so that in the midst of personality questions half of the participants were asked if Mt. Everest is more than 2000 feet tall and the other half were asked if they think Mt. Everest is more than 45,500 feet tall. After that page, participants were asked to enter their estimate of the height of Mt. Everest. We made this change because we wanted to see if the anchoring effect would persist if the time between the questions was longer, and if the anchor was embedded in a matrix full of attention-taxing stimuli.

Results

Big Five Inventory

Like last month, farmers had low Cronbach’s alpha scores, showing low reliability across all five factors, while non-farmers had high alphas (see Table 1).

Table 1: Cronbach’s Alpha coefficients for each factor of the BFI, for farmers and non-farmers

Personality FactorFarmersNon-farmers
Openness0.4290.866
Conscientiousness0.6860.894
Extraversion0.6840.915
Agreeableness0.6320.841
Neuroticism0.7290.905

Using our Squared Discrepancy Procedure (SDP) (Litman, Robinson, & Rosenzweig, 2015), we examined how consistently participants respond to forward and reversed questions (e.g. “I tend to be organized” vs. “I tend to be disorganized”). The measure yields a number, expressed as a percentage, to describe how consistently participants respond to reversed items. In the graph of squared discrepancy scores (Figure 1), we see that around 47% of farmers fall below a score of 75—the clear cutoff for random responding—while only 2% of non-farmers do. This is consistent with our findings from last month.

Graph showing squared discrepancy procedure (SDP) scores comparing farmers and non-farmers, with 47% of farmers falling below the 75% consistency threshold for random responding, while only 2% of non-farmers did so, indicating much higher response quality from non-farmers
Figure 1: Consistency of responses to front- and reverse-coded BFI items.

Using the  BFI as a measure of attentiveness and internal consistency, we see that data obtained using our tools is of very high quality. This trend was repeated in the other measures as well.

Anchoring Task

Our findings in the anchoring task provide compelling evidence that MTurk participants are very attentive. Remember, the anchoring manipulation was embedded in the BFI among over 50 other questions presented in matrix format (see Figure 2).

Screenshot of survey flow for the anchoring task experiment, showing how the Mt. Everest height anchor question was embedded within the Big Five Inventory matrix, followed by 11 other questions before participants were asked to estimate Mt. Everest's actual height
Figure 2: Survey flow for anchoring task

Additionally, participants answered 11 questions after seeing the anchor and before providing their estimate of the height of Mt. Everest. This is a very challenging scenario to investigate anchoring effects in. Still, we see a clear anchoring effect for non-farmers while we see no such effect for farmers (see Figure 3). This shows that the MTurk workers are, as they have always been, high quality participants who pay attention to details.

Bar graph comparing the anchoring effect on estimates of Mt. Everest's height between farmers and non-farmers, showing non-farmers were significantly influenced by the high/low anchor as expected (29,029 vs 10,088 feet), while farmers showed no clear anchoring effect (13,717 vs 15,445 feet)
Figure 3: Numerical estimates of Mt. Everest’s height. Farmers responded similarly in both conditions.

For the low anchor, most participants in both groups agreed that Mt. Everest is taller than 2,000 feet (see Figure 4). For the high anchor, however, while most of the non-farmers disagreed with Mt. Everest being taller than 45,500 feet, 66% of farmers agreed (either a little or strongly) with the statement (see Figure 5).

Bar graph showing responses to whether Mt. Everest is more than 2,000 feet tall (low anchor), with non-farmers predominantly answering 'Agree strongly' (90%), while farmers' responses were more varied across all answer options, suggesting less consistent or attentive responses
Figure 4: Percent of participants who agreed or disagreed with Mt. Everest being more than 2000 feet tall. Farmer responses were more spread out while non-farmers skewed to “Agree strongly.”
Bar graph showing responses to whether Mt. Everest is more than 45,500 feet tall (high anchor), with non-farmers correctly answering 'Disagree strongly' (72%), while 66% of farmers incorrectly agreed with the statement, indicating less factual knowledge or attention to the question
Figure 5: Percent of participants who agreed or disagreed with Mt. Everest being more than 45,500 feet tall. Non-farmers skewed to “Disagree strongly” but farmers peaked at “Agree a little.”

Figures 4 and 5 suggest that MTurk workers are either comprehending the question better or paying more attention when responding.

Trolley Dilemma

Non-farmers’ responses to the trolley dilemma replicated established findings once again, while farmers did not. Farmers chose to save five people over one person in both scenarios while non-farmers only made that decision when they had to turn the train and not when they had to push a man in front of the train (see Figures 6 and 7).

Bar graph showing responses to the classic trolley dilemma where participants choose whether to turn the train to save five people at the cost of one life, with both farmers and non-farmers responding similarly with around 80% willing to turn the train
Figure 6: Percent of participants willing to turn the train in the classic trolley dilemma. Farmers and non-farmers responded very similarly.
Bar graph showing responses to the footbridge trolley dilemma where participants choose whether to push a man off a bridge to save five people, with a stark contrast between groups - only 28% of non-farmers would push the man (replicating established findings), while 75% of farmers would, showing failure to differentiate between the two trolley scenarios
Figure 7: Percent of participants willing to shove a man onto the tracks in the footbridge trolley dilemma. Farmers and non-farmers responded contrarily. Farmers chose the utilitarian response at rates similar to the classic dilemma.

Open-ended Response

We coded participants’ responses to the open-ended trolley questions (“Please describe the reasons for your response.”) for both conditions of the dilemma.

Bar graph showing the quality of open-ended responses to the classic trolley dilemma, with 100% of non-farmers providing high-quality answers, while farmers' responses were mixed - 46% high quality, 23% acceptable quality, and 31% junk responses
Figure 8: Quality of responses to open-ended trolley question in the classic version.

For the classic version, all non-farmers provided high-quality responses while farmers were mixed: some provided good data, some provided acceptable data, and some provided junk (see Figure 8).

Bar graph showing the quality of open-ended responses to the footbridge trolley dilemma, with 96% of non-farmers providing high-quality answers, while farmers' responses were again mixed - 44% high quality, 22% acceptable quality, 30% junk responses, and 4% no response
Figure 9: Quality of responses to open-ended trolley question in the footbridge version.

Four percent of farmers in the footbridge version did not respond to the open-ended question. Almost all non-farmers provided high-quality responses while farmers had some good data, some acceptable data, and some junk data (see Figure 9).

Re-Captcha and Honeypot

As last time, all participants passed the re-captcha and honeypot questions, suggesting that these are human beings rather than bots.

English Proficiency Screener

Nearly all non-farmers passed the English proficiency screener, while about 71% of farmers failed (see Figure 10).

Bar graph comparing English proficiency test results between farmer and non-farmer participants, showing that 98% of non-farmers passed all four questions in the English proficiency screener, while only 29% of farmers did, demonstrating a significant difference in language proficiency
Figure 10: Percent of participants who passed all four questions in the English proficiency screener.

Cultural Checks

For our cultural check we kept the questions we used last time. Please note that these questions were open-ended, meaning participants were not choosing from a set of response options.

High-quality photograph of a purple eggplant (aubergine) vegetable against a white background, used as a cultural identification test in the study to differentiate between US-based participants who would call it 'eggplant' versus participants from other regions who might call it 'brinjal'

Figure 11. Visual stimulus used for “What is the name of this vegetable?”

Percent of participants responding “Eggplant” vs “Brinjal” to the picture prompt.
Figure 12: Percent of participants responding “Eggplant” vs “Brinjal” to the picture prompt.

All non-farmers identified the vegetable as an eggplant, while about half of the farmers identified it as brinjal (see Figure 12). All non-farmers passed all four cultural trials while 78% of farmers failed at least one of the four trials (see Figure 13).

Bar graph comparing cultural knowledge between MTurk participant groups, showing 100% of non-farmers passed all four cultural trials designed to identify US-based participants, while only 22% of farmers passed all trials, suggesting most farmers may be from different cultural backgrounds
Figure 13: Percent of participants who passed all our cultural trials.

Overall Time on the Survey

Farmers took about eight minutes longer than non-farmers to complete the survey (see Figure 14). This is consistent with our conclusion last time that farmers are likely people who are either taking a lot of HITs at the same time, or are taking longer to respond to surveys in English.

Bar graph comparing average survey completion times between participant groups, showing non-farmers completed the survey in about 15 minutes, while farmers took approximately 23 minutes - about 53% longer, possibly due to language barriers or multitasking across multiple HITs
Figure 14: Time taken to finish survey.

Summary

Overall, we see that activity from workers who use server farms has decreased. There are, however, some farmers who remain active and as a group they continue to provide mostly random responses. Using TurkPrime’s tools we were able to block farmers from a standard study and obtained high-quality data as a result. Importantly, we did not run a control group without our tools, meaning it  is possible there has been an overall decrease in the activity of farmers on MTurk that we could not detect in these studies. Taken together, our results suggest a few things:

  1. It is possible to collect high quality data using TurkPrime’s tools.
  2. The most active workers who were providing bad data have reduced their activity on MTurk.

Moving forward

We are monitoring data quality on an ongoing basis, and in the next few weeks we will be sharing more results. We will also continue developing and refining tools to help you collect good, high quality data on Mechanical Turk.


Citation

Litman, L., Robinson, Y., Moss, A.J., Gautam, R. (2018, Oct. 29). Moving beyond bots: MTurk as a source of high quality data [blog post]. Retrieved from /resources/blog/moving-beyond-bots-mturk-as-a-source-of-high-quality-data/

Share this blog

Related Articles

Understanding Turkers: How Do Gig Economy Workers Use Amazon’s Mechanical Turk?
Understanding Turkers: How Do Gig Economy Workers Use Amazon’s Mechanical Turk?

By now, most people have heard of the gig economy and have some idea of how it works. In the gig economy, people perform short-term jobs or tasks to earn ...

Read More >
How to Award a Bonus to MTurk Workers Using CloudResearch
How to Award a Bonus to MTurk Workers Using CloudResearch

One feature of Mechanical Turk is the ability of researchers to give workers bonus payments. Bonuses may be issued for various reasons such as exemplary performance on a task, answering ...

Read More >

SUBSCRIBE TO RECEIVE UPDATES

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.