The following article provides an introduction to CheatSweep, with a focus on the various tags Survey Programmers (SPs) can implement for CheatSweep. For information on the CheatSweep reporting applet, see CheatSweep report.
For clients seeking additional information on what CheatSweep is and how it works, see CheatSweep overview for clients.
What is CheatSweep™?
CheatSweep is IntelliSurvey's patent-pending system designed to help keep survey data clean. As the famous cartoon says, "On the Internet, nobody knows you're a dog." But when survey data is used to make important decisions, you should try to keep the "dogs" out of your data. This is particularly important on surveys where respondents are paid to complete the survey, because then there is a larger incentive for cheaters. But even on non-paid surveys, you want to make sure that your data is accurate and not skewed by respondents that lie about their background or qualifications. CheatSweep can help by detecting potential data problems and automatically sweeping questionable responses to a "Fraud" (F) status.
CheatSweep uses a variety of metrics to answer several questions about respondents, including:
- Is this respondent paying attention?
- Is the respondent going too fast compared to others, or compared to a baseline?
- Do the respondent's IP address and ISP information indicate he or she might not reside where we expect? Or could the one person have started the survey then handed it off to another respondent?
- Is this a duplicate respondent? That is, has this same person or computer already submitted another response to the survey?
IntelliSurvey uses a variety of metrics to answer each of these questions, including things such as IP address geo-location, browser fingerprinting, attention check questions, cookies, speed checks, and so on. By combining these metrics using a mathematical algorithm that we adjust over time, we can detect potential cheaters more thoroughly than with any single measure.
After computing these metrics, IntelliSurvey assigns a "cheating probability" from 0 to 100 for each respondent. The higher the score, the more likely that a particular response was tainted by cheating or inattention. You can set a threshold so that respondents over a certain score (for example, over 95) are automatically marked as F and thus not included in counts of survey completes.
How CheatSweep works
Every time a survey respondent submits a page of survey data, his or her browser also silently sends along other information – such as cookies set by the survey server, IP address, and headers such as the name of the browser in use (the "user agent"), and so on. In addition the IntelliSurvey software also tracks other information such as the elapsed time on a particular page, and the browser ID (aka a browser "fingerprint"). Furthermore a survey can also include explicit attention check questions, and we can also check for patterns in other responses (such as table straight-lining). Often, none of these pieces of information by themselves will give us an accurate picture of whether or not a respondent is cheating. But by combining these metrics together and using statistical and Bayesian techniques, we create a "CheatSweep score" for each respondent. This score is from 0 to 100, where '0' means there is no indication of cheating, and '100' means that we are fairly certain something is fishy. If a threshold is set and the respondent is above it, the response is automatically and immediately moved from a Complete status C to F for Possible Fraud. Once a respondent has been swept to F, they will never again be considered a Complete.
It is important to note that in order to have an accurate formula and avoid bias, CheatSweep includes data from several statuses such as C, F, B, J, and X, in its calculations, but these statuses are not swept (reclassified) by CheatSweep.
Note that while we use the combined data to catch the widest range of cheaters, there can also be obvious signals that don't require much data. For example, if the survey targets US consumers, and the IP indicates the response is from China, there is no need to wait for complete data – we can throw the respondent out immediately if desired. Similarly, each completed respondent receives a cookie from the survey server. If after completing one response, we receive another response that shows an identical cookie, there is no need to wait for additional data, and again we can toss the respondent immediately if desired.
Note: Surveys must be in Live mode to run CheatSweep. By default, CheatSweep will not begin sweeping until after a survey has 50 completed respondents. For more information on defining a starting point for CheatSweep, see below.
Using CheatSweep
The following sections contain information about common uses of CheatSweep. For information on setting custom CheatSweep rules, click on Custom CheatSweep Rules.
To enable CheatSweep in any survey, simply add the following code.
enable cheatsweep
This line activates the core tracking and data gathering features of CheatSweep. It does not, however, automatically send respondents to status F as-is. There are two methods for reclassification - setting a threshold or setting a removal target percent.
Note: In order to have access to the CheatSweep Report applet, the survey must contain enable cheatsweep
in the published version of the survey. Without that code, the applet is hidden from users.
Users must have a survey role of Survey Owner, Manager, or Editor to view the CheatSweep Report, or be assigned the add-on role of CheatSweep Viewer. Only Survey Owners, Managers, and Editors can assign other users the role of CheatSweep Viewer.
For more information on using the CheatSweep applet, see CheatSweep report.
Tags
The following tags can be used with the CheatSweep widget. For more information, see the sections that follow.
Tag | Description | Example |
---|---|---|
threshold |
Sets a minimum CheatSweep probability (csp) score for considering a respondent as possible fraud. Input may be an integer or decimal between 1 and 99, inclusive (e.g., '90' indicates anyone with a 90% or higher chance of being fraudulent should be removed). | threshold: 45 |
remove |
Sets a percentage for which respondents should be removed from the data. An input of 'n%' indicates the highest n% of csp scores should be removed. | remove: 5% |
start_sweep |
Sets a limit to the number of respondents that should be allowed into the survey before CheatSweep starts sweeping the data. By default, CheatSweep will begin analyzing data after 50 respondents. | start_sweep: 100 |
group_by |
Allows Survey Programmers (SPs) to specify a pre-defined variable to use to sweep respondents within the groups they have been assigned, as opposed to sweeping all respondent data together. Sweeping data this way prevents one group from possibly skewing the data for all respondents. | group_by: QPANEL |
repeat_sweep |
Accepts 'y' (yes) and 'n' (no) inputs; setting repeat_sweep to 'n' prevents earlier scored respondents from being reclassified (from F to C, or C to F) due to later results. |
repeat_sweep: n |
allow_country |
Allows users to specify a country (or countries) for which IP addresses are allowed for respondents; respondents not matching that country will be immediately termed. Requires country codes for input. Multiple countries may be specified in a comma-separated list. | allow_country: US,GB,CA,BR |
disallow_country |
Allows users to specify a country (or countries) for which IP addresses are not allowed for respondents; respondents matching the specified country codes are immediately termed. Requires country codes for input. Multiple countries may be specified in a comma-separated list. | disallow_country: CN,KR |
expected_country |
Allows users to specify a country (or countries) for which IP addresses are allowed for respondents. Unlike allow_country , respondents not matching the specified country will just be flagged and not removed. Requires country codes for input. Multiple countries may be specified in a comma-separated list. |
expected_country: US,GB |
allow_dup |
Accepts 'y' (yes) and 'n' (no) inputs or with condition logic; setting to 'y' allows multiple responses from a single person. By default, people that attempt multiple responses will be termed and coded as "duplicate." |
|
|
Accepts 'y' (yes) and 'n' (no) inputs; determines if a duplicate respondent should be classified as a termination (status T) by sending them to the "term" end group, or classified as a duplicate (status D) and sending them to the "dupe" end group. Using termdup: y is the same as the default or not using the termdup tag at all. Adding termdup: n will alter the default and classify and status duplicate respondents as Duplicates. |
|
(alias |
Accepts 'y' (yes) and 'n' (no) inputs; can be added to individual variables, or groups of variables, to allow panel-specific variables to still be run/punched before a Duplicate respondent is sent out of the survey. | csentrycalc: y |
seeking_value |
Allows users to include condition logic referencing a specific question (or series of questions) to flag respondents who might be attempting to qualify for a survey when they shouldn't. | seeking_value: countChecked($Q1) |
Setting a threshold
To set a threshold, include the threshold
tag.
enable cheatsweep threshold: 55
In the above example, the '55' indicates that you want to remove any record with a 55% or greater chance of being fraudulent. The csp
(CheatSweep probability) field in the data, which will appear when enable cheatsweep
is added to the survey code, will contain a value between zero and one, which is each individual record's decimal equivalent of that percentage. Thus, a value of .9 in the csp
field for a given record indicates that there is a 90% chance that it is fraudulent. The higher the number specified in the threshold
tag, the higher the cheater probability tolerance.
The overall quality of the data should be considered when choosing the threshold. In the case of high quality samples, you probably want to throw out only a few records, or none at all. Thus, a probability tolerance could be quite high, 90% or greater, depending on the sample size. In the opposite scenario, with a lot of bad data, you would likely want to accept less risk, and sweep any record with 60% chance of being fraudulent to status F. Thresholds should be considered thoughtfully for best results.
Tip: The threshold
field can accept decimal values.
Setting a removal target percent
Some situations might require picking an arbitrary portion of survey sample to remove with CheatSweep. This is an imperfect solution: to arbitrarily select a percentage is analogous to asking your spam filter to remove a set percentage of your emails. You wouldn't know whether to set it at 5% or 50%, because it all depends on how many spam email messages you get. Nonetheless, this methodology persists. To accommodate, simply include the following code.
enable cheatsweep remove: 5%
This will remove the worst (highest) 5% of csp
scores.
Note that as new records come in, the old records are not rescored, so their csp
values will be stable. However, new records can and do affect the csp probability of old records. Each time CheatSweep runs, it recalculates the csp_percentile
for all records, even old records. Thus even though the csp
value will be stable, the csp_percentile can change. For example, if a large number of clean new records come in, they might make the older records look bad, and the csp_percentiles for the old records may increase. However, unless the threshold changes, and subject to the restrictions of the repeat sweep
tag (see below), these older records will not be changed from C to F. This means that we might remove less than the target percent - e.g., using remove: 5%
may only remove 4% of respondents. This is relatively unusual, particularly if records arrive in a random fashion with respect to their cheat scores, but it could happen, so survey programmers should be aware that the removal percentage is not always an exact measure of the percentage to be removed.
Removing by csp
will tend to be more stable, so if you used threshold: 90
, then all records with a csp
score above 0.9 would be removed. The CheatSweep process will not examine csp_percentile
in this case, so the fact that it can vary over time makes no difference.
Possible fraud records
Once a respondent has been swept to status F for being over the threshold or removal target percentage, they will only again be considered status C when:
- The
cs_rescore
has been set to '1', which must be done manually via Add Data. - The record was originally scored when there was not enough timing data to use as a baseline. The first 50 records still receive scores, but without timing baselines, the scores are not as accurate. Therefore, once there are more than 50 records, the original 50 are rescored. At that point, those 50 records can go from F back to C.
Tip: CheatSweep re-runs automatically every 30 minutes when there are new completed records.
Setting a starting point
By default, CheatSweep starts using rules that require percentiles only after at least 50 completed responses have been recorded. Using the start sweep
tag permits the user to define a different starting threshold.
enable cheatsweep start sweep: 500
Using the code above would trigger CheatSweep to begin scoring once N = 500 completes occurs.
Scoring by group
Occasionally surveys may require that CheatSweep not score all respondents together, and that instead, they are scored within defined groups. This need may occur for a multitude of reasons - e.g., one country's results are skewing the overall results; sampling is occurring via multiple methodologies (phone, online, paper surveys); etc. If there is concern that a group may affect the survey's overall results, the groupby
tag may be added to the enable cheatsweep
widget to tell CheatSweep to score within each individual group instead of combining all respondents together.
To use groupby
, first create a variable to define the various groups that respondents will be assigned to (e.g., QPANEL or QCOUNTRY). After the variable has been defined, simply use groupby
as follows:
enable cheatsweep group by: QVARIABLE_NAME
Once group by
has been defined, respondents will be assigned to their groups and the start sweep
, percentiles, and removal percent will only apply within each group.
Repeat sweeping
CheatSweep continues to score in real time as responses filter in, and as such, respondents close to the threshold may be redefined from C to F if they reach this tipping point. For example, consider a situation where respondent #99 was just below the threshold of being counted as fraud for the first 100 respondents. Respondent #101 comes in, is not anywhere close to the threshold, and therefore pushes respondent #99 right over it. CheatSweep would then reclassify respondent #99 from C to F. As stated above under Possible Fraud Records, respondents can go back from F to C in certain situations as well.
Users can counteract re-classification of previously scored respondents by including repeat sweep: n
in their survey.
enable cheatsweep repeat sweep: n
CheatSweep defaults to y
in the code above, which permits respondents to be swept again to a new status as more data comes in. If you wish to disable this feature, setting repeat sweep: n
will prevent respondents from being swept to a new status after the initial sweep.
Note that if the survey's CheatSweep threshold is increased - e.g., from 'remove: 5%' to 'remove: 10%' - and repeat sweep: n
has been implemented, since it will skip records that have previously passed, the number of removed records would be less than the target percentage. In the example here, the actual removal percentage would likely be less than 10% since some records in the 5-10% range that had been swept initially would not be re-swept, and thus would not be reclassified as F.
Setting country options
For all IntelliSurvey surveys, the survey engine automatically stores the internet address (IP address) of each data submission to the server. This is possible because IP addresses are managed centrally and handed out in blocks that are then associated with particular locations and countries. IntelliSurvey subscribes to a service (see maxmind.com) that maintains a database of IP address to their associated data. Thus every response will have several related data fields appended to it, such as the raw IP address, the country code, the longitude and latitude, and the ISP associated with the IP address. This data is gathered even when CheatSweep is not active.
When CheatSweep is active, you can set a list of allowed countries, blocked countries, or expected countries. Following are brief examples of each.
Allowed countries
Suppose that your survey is only relevant for residents of Canada. You can enable CheatSweep and block all IP address that are not associated with Canada like this.
enable cheatsweep allow country: CA
Note that we use "CA" because that is the country code for Canada. A complete list of country codes can be found here: https://www.geonames.org/countries/
When any other country is detected by the IP address, the respondent will immediately be sent to the "term" page of the survey (and one will be automatically generated if no term page exists in your survey). Furthermore, the disp
question will be flagged with "wrong country" for accounting purposes. This check will happen before the first page is viewed, and on any page submission. Therefore, a respondent from the wrong country will not even see the first page of the survey. If the respondent starts in the correct country, and then hands off to another person residing with a non-allowed IP address, as soon as the person submits a page from the offending IP address, they will be sent to the termination page of the survey. This check happens with each page of the survey.
If more than one country should be allowed, simply list them in a comma-delimited list, e.g., allow country: CA, US
.
Disallowed countries
Sometimes you might want to allow responses from a wide range of countries, yet still block certain other countries. For example, suppose that you don't mind where the responses come from, but you definitely don't want any responses from China or India. In that case, you could use the following.
enable cheatsweep disallow country: CN, IN
Again, this check will happen before the first page is shown, and with every page submit and display. Any respondents with an IP address from a blocked country will be immediately sent to the terminate page of the survey. As with the allow country
check, any terminated responses will also have "wrong country" set for the disp
field.
Expected countries
As noted above, the allow country
and disallow country
checks are immediate. This is often helpful, but suppose that you want to accept responses from anywhere, but still have some flag in the response to indicate that it came from an unexpected country. In this case you could use, e.g., expected country: US
. In this case, respondents from non-US countries will be allowed to continue the survey, but their responses will be flagged as coming from an unexpected country and will have the cheating score increased accordingly. The response can then be swept into the F status, depending on the threshold and other rules that matched. See below for more explanation of the sweeping process.
enable cheatsweep expected country: US
This creates a new CheatSweep field, "unexpected country". Respondents from non-US countries will have a flag in this field.
Duplicate checking
Another category of survey fraud is committed when a respondent attempts to complete the same survey more than once. The CheatSweep system uses several mechanisms to detect this, but the simplest indicator is with browser cookies. Whenever a respondent takes a survey, the IntelliSurvey system sends a cookie to the browser, which is then sent back by the browser with each data submission. Then at the start of any new response, the system checks for the use of a cookie that was already used to complete the survey. If found, then by default the CheatSweep system immediately terminates the duplicate response, and sets its disposition to "duplicate".
This feature is automatically enabled whenever CheatSweep is enabled, so no configuration is required. Test responses are automatically excluded from this check (although a message will be included with test notes in this case).
There may be times, however, when you want to allow duplicates. For example, suppose that you release a "diary" type survey that requests more than one response from a single person. In this case, you can still enable CheatSweep, but use the allow dup
flag as shown below.
enable cheatsweep allow dup: y
Sometimes only certain respondents should be allowed to bypass duplicate checking (e.g., a CATI call center). See Custom CheatSweep Rules for how to do this.
Duplicate status and routing
By default, the system treats duplicates as terminations. This means that when a duplicate respondent is detected upon entering a survey, they're immediately sent to the "term" end group, their status is set to T, and they're redirected back to the panel, if applicable, using the termination status redirect URL for that panel. Often times panels and clients want to treat duplicates different than terminations. This means assigning a different status and also using a different redirect so that panels can treat those respondents differently. To handle this, the termdup
tag was added to CheatSweep.
The termdup
tag when set to 'n' will alter the path of the duplicate respondent. Instead of it routing to the "term" end group and setting the status to T, it will instead route them to the "dupe" end group and set the status to D (Duplicate). This allows there to also be a different redirect within associated with the duplicate status and thus the panels can more easily differentiate between terminated and duplicate respondents.
enable cheatsweep termdup: n
Punching variables for duplicates
Duplicate checking occurs at the very start of the survey. Because of this, none of the survey's questions or variables will punch before the respondent is sent out of the survey. This is fine for questions and other survey variables within the area of the actual survey, but is troublesome for panel specific variables that are needed to populate values within the redirect URLs. To allow for this, the csentrycalc
, or cheatsweep entry calc
, tag can be added to individual variables or groups of variables. By adding this tag, those variables will run/punch before the duplicate respondent is sent out of the survey where the end pages or redirects are then shown.
PANELVAR. Panel variable type: text invisible: y csentrycalc: y cvalue: {'123999'}
Other types of duplicate checking
If CheatSweep detects two responses using the same browser cookie, then we can be sure they are from the same browser and therefore are duplicates. However, cookies can be turned off or erased, or a respondent can simply use a different browser when attempting to submit a duplicate response. For this reason, CheatSweep also includes other checks, such as looking for duplicate browser IDs and IP addresses. These checks are included in the overall CheatSweep score, and are not grounds for immediate termination, however. See below for more details on how the CheatSweep scoring works.
Seeking
Respondents that try too hard to qualify for a survey tend to be bad respondents. These may be people who attempt to qualify for surveys just for an incentive and haven't really used the products or had the experiences they claim to have. We can often find these respondents by asking simple screening questions like this.
X. Have you purchased any of the following products in the last 3 months? Please check all that apply. type: checkbox 1. Washing machine 2. Lawnmower 3. Helicopter 4. Harmonica 5. Car
Somebody who checks everything here is suspect. To include a seeking value in the csp
calculation with a similar question, add the following.
enable cheatsweep seekingvalue: countChecked($QX)
Since countChecked($QX)
will return the number of options selected in QX, respondents selecting more options will have a higher seeking score.
In other surveys, seeking behavior may be determined in other ways, such as answers to open-ended questions or a combination of answers to several questions. In order to allow flexibility to accommodate different types of surveys, the seeking value
tag allows an expression.
enable cheatsweep seeking value: countChecked($QX) + (anyChecked($QX,3)*10)
This gives a higher weight within the seeking calculation to respondents indicating that they purchased a helicopter in the past year.
CheatSweep scoring
As described above, the CheatSweep system can immediately terminate responses that are from the wrong country or that are obvious duplicates. Because these are strong signals of cheating, no further data gathering is necessary. In many cases, however, cheating is more subtle. Therefore, we use a number of data points to determine the likelihood of cheating. Each indicator by itself may not be sufficient to determine whether or not a respondent is "cheating", but in combination the factors can reliably make a prediction. CheatSweep uses formulas derived from Bayes' Theorem to combine the estimated probabilities, in much the same way that spam detection algorithms operate. See http://en.wikipedia.org/wiki/Bayesian_spam_filtering for background.
For example, suppose that a respondent takes the survey more quickly than most others, and is in the fastest 10% as measured by average seconds per question answered. That by itself does not mean the respondent is cheating. After all, some respondents (10% in fact) will be in the fastest 10%, but this does increase the odds that the respondent is going too fast. Now further suppose that the same respondent used three different IP addresses when completing the survey. Again, this is not an absolute indicator of cheating, but it should also increase the likelihood that the person is not a reliable respondent. The CheatSweep methodology combines a series of rules like this, and uses it to predict an overall probability that the response is worthwhile. By combining factors, we calculate a more valuable and reliable metric that we could by using one measurement alone.
Calculated variables
When CheatSweep is enabled, a number of variables are added to the survey dataset. These variables are not meant to be used directly, and are here only for reference. They are used while calculating the CheatSweep score.
Most of these variables can be found under Cheatsweep fields in the Reporting Field Tree, though tvar_n
, tvar_final
, and any of the individual table variance variables (e.g., VAR_T1) can be found under the Table variance folder (a sub-folder of Cheatsweep fields).
Variable Name | Description |
---|---|
csp |
Cheatsweep calculated cheating probability. |
cs_bad |
Cheatsweep reasons. |
cs_good |
Cheatsweep mitigating factors. |
cs_percentile |
Percentile rank for cheating probability. |
attention_pass_count |
Number of attention check question passes. |
attention_fail_count |
Number of attention check question failures. |
seconds_per_answer |
Elapsed seconds per question answered. |
answer_count |
Count of questions answered. |
browser_id |
Browser ID (pseudo-fingerprint). |
|
Number of IP addresses used. |
|
Number of countries by IP address. |
|
Number of ISPs by IP address. |
|
Maximum distance between any two IP addresses. |
|
Number of isid cookies used. |
|
Number of devices used. |
|
Number of other responses using an IP address also used here. |
|
Number of other responses using an isid also used here. |
|
Count of user errors. |
|
User error rate. |
key_count |
Count of key presses. |
click_count |
Count of click/touch events. |
spa_percentile |
Percentile rank for seconds per answer. |
open_char_percentile |
Percentile rank for characters in open-ended questions. |
open_char_word_count |
Count of words found in open-ended questions. |
open_char_token_count |
Count of non-punctuation strings found in open-ended questions. |
open_char_stoplist_count |
Count of stoplist words found in open-ended questions. |
open_char_count |
Count of characters found in open-ended questions. |
tvar_percentile |
Percentile rank for table variance. |
cs_audit |
Cheatsweep audit trail. |
cs_rescore |
Cheatsweep rescore flag. |
VAR_Tx |
Answer selection variance for a table 'x' (there are often several variables of this nature, one for each table, found under the Table variance sub-folder). |
tvar_n |
Total N for combined table variance (found under the Table variance sub-folder). |
tvar_final |
Weighted combined table variance statistic (found under the Table variance sub-folder). |
Initial timing and statistical data
As described above, CheatSweep uses statistical measures to compare respondents. As with any statistical measure, a certain set of baseline data is required to create statistics. Therefore, although CheatSweep runs immediately from the onset of the survey, it will have incomplete data for the first 50 respondents. To accommodate this, CheatSweep will operate in the background and then begin to apply the rules that compare the records to each other once there are 50 completes. After the first 50 completed responses have been recorded, CheatSweep goes back and rescores the initial responses. As appropriate, the system will then flag respondents with scores over the threshold and move them to status F.
Data cleaning variables
Hand-in-hand, with CheatSweep's automated data cleaning process, Project Consultants and Clients have to manually review respondent data and reclassify respondents from C, Complete, to another final status. Since status F, potential Fraud, is reserved for CheatSweep's automated process, the person reclassifying the data should use status X, Expunged. We have two standardized fields, clean_status
and remove_reason
, that can be used for this cleaning process. It is preferred to use these standardized fields in lieu of creating manual ones so that there is consistency across all projects.
The fields clean_status
and remove_reason
can be found in the CheatSweep Fields chapter of the Field Selector. They are auto-generated on each survey so long as the CheatSweep widget is enabled. The variable clean_status
is a single-punch variable. It is designed to be blank by default and not store any value unless manually updated via the Add Data applet or the Batch Update applet. The remove_reason
variable is a text field which can store the reason for the respondent's removal, e.g., "Nonsensical words". Note the Batch Update applet can only update closed-ended, single-punch variables. Therefore, the most common approach for updating both fields simultaneously is the Add Data applet.
The clean_status
and remove_reason
fields are defined as follows:
clean_status. Data cleaning status type: radio 0. Not reviewed 1. Reviewed and ok remove_reason. Data cleaning removal reason(s) type: textbox
Comments
0 comments
Please sign in to leave a comment.