Last time we discussed how the Turing test, flawed as it is, still holds some relevance for AI development – especially when it comes to intelligent assistant applications. For more on that, read more here.
However, consistently measuring the test is a methodological problem, as it focuses on psychological measurements. Because a human makes the judgement on whether the test subject is a machine or not, these judgements must be established as accurately as possible, objectively and without bias.
What follows are the proposed rules that every measurement of Turing time should satisfy.
Rule 1: No naïve tests
The judge must know they are taking part in a test. For example, someone running a chat bot might not tell visitors, and only later ask if they noticed anything strange or whether they thought who they were chatting to was a machine. Such a test would be biased towards giving bots incorrectly long Turing times.
Rule 2: Be yourself
This holds for every human taking part in the tests – be it the assessor or the test subject. Every participant should be instructed not to imitate a machine in any way. They simply need to be themselves – to be human. At the end of the test, they should be rewarded for not being mistaken for a machine.
Rule 3: 50-50
The probability that a human judge is talking to a machine in any single test should remain at 50%. That is, there should always be a 50% chance that a test subject is human, and 50% that it is a machine. Also, the judge should know this likelihood.
Rule 4: Average Joe
The judges shall represent the education and culture of the general population. The judges should by no means be experts in AI, or experts in human behaviour. Therefore, judges should be sampled from the general population.
Rule 5: Free choice of topics
There should be no limits to the topics discussed. This means also that no theme of conversation can be assigned – implicitly or explicitly. Similarly, there should be no limits to topic changes. For example, it should not be considered a proper Turing test if one is limited to conversations about medical issues.
Of course, it is legitimate to determine how long it takes for users of a medical AI to notice that they are not talking to a human physician, but that result cannot be called Turing time. The result may be called something like limited domain detection time. What Alan Turing meant by imitation game was whether a machine can replicate human intelligence, and this is what we are measuring under Turing time.
Rule 6: No time or word limit
We don’t ask whether AI can be uncovered within 30 minutes, or after 1000 words have been exchanged. We are also not interested in how often the presence of machine is revealed in a 30-minute test. The imitation game is by its nature asymmetric; a machine imitates a human, not the other way around. This is why the test ends the moment the assessor can make a confident judgement on which of the two test subjects is a machine.
Rule 7: Human faults
The machine is expected to exhibit human faults, and thus can be detected through super-human performance. For example, if the machine has an encyclopaedic knowledge or can quickly performs complex calculation, the judge could legitimately conclude that this is a machine. Of course, there is nothing wrong with endowing AI with super-human abilities outside the imitation game, but here we are determining whether and how long the machine can hold the illusion that it is human. A large part humanity is fallibility.
Rule 8: Multiple tests, minimal time
In order to derive a robust Turing time measurement, multiple tests must be taken using multiple judges and human test subjects, with the same AI. The Turing time is defined as the minimum of all those times. We do not consider the average or median.
The Turing time is not about how long it takes on average for the machine to be detected, but how long the machine is guaranteed to hold the illusion.
The reasoning behind this rule is the same as that of two-year product warranties. If you purchase a car – or any other product – you don’t want it to work with some likelihood, you want it to work, period.
Similarly, if you get a new state-of-the-art AI that imitates human intelligence, you don’t want that imitation to maybe work for a bit – it must be guaranteed to work for a period of time. Good user experience – i.e. good, human-like interaction with AI – is about how well the technology performs for everyone, across all types of applications. This is why Turing time is defined as the minimal time for the test to end. This also means that once Turing time is measured, the result is not fixed – it can be reduced later to a lower value.
Rule 9: Turing word count
An additional way of indicating the amount of interaction needed to detect that one is interacting with machine is in the number of words used in the communication. This number may be published along with Turing time, and can be referred to as the Turing word count.
Rule 10: Valid justification
Each detection of machine-like error must be documented with an explanation as to why it has been decided to consider it an error. One needs to document how the machine responded, and explain why this was not accepted as a human-like response. Also, it would be good to state how a human could have communicated in that situation.
For example, if the assessor asked, “I need to fly to London next week. Can you find a flight for me?”, and the machine responded with, “I cannot find a flight to destination: London Next Week.”, then one would document that the machine had misunderstood the destination and time frame. A human may ask for more clarity: narrowing the timeframe with a follow-up question such as, “Which day next week would work for you?”. Through proper documentation, anyone should be able to see how the Turing time had been established.
Rule 11: Intermittent conversation
The judge should be permitted to break the interaction, returning to continue the test at a later point in time. The Turing time should count from the very beginning of the entire interaction with that AI. As AI advances significantly in the future, such measurements of Turing time will become a necessity.
Rule 12: Under measurement
If a machine has not revealed itself for longer than 50 hours, its Turing time can be reported as “under measurement” until the machine finally makes a mistake. Imagine the year 2117; a new state-of-the-art AI is already five years old. The Turing tests began the moment the AI went live, but to date no single machine-like error has been detected. The machine’s Turing time is reported as “under measurement” and with an indicator of how long has it been e.g., “1540 hours without error”.
Rule 13: Don’t alter your AI
During the period of testing, the AI should not be altered by a third party. The machine is allowed to learn, but only in the same way a human would learn. For example, the AI could obtain information about recent world events, as it could enable interaction with the assessor. However, the underlying technology of AI should not be improved whilst the AI itself is being assessed.
If the AI is altered significantly outside of a period of testing, that AI should be considered as new, and a separate set of tests should begin.
Of course, on top of these thirteen rules, all the other known methodological factors of scientific measurements and experimentation must be considered and applied. This includes proper randomization, sampling, double blind designs, replicability of results, and so on.
What do you make of these rules? What did I miss? Tweet me to discuss your AI.
Danko Nikolic is a brain and mind scientist, as well as an AI practitioner and visionary. His work as a senior data scientist at Teradata focuses on helping customers with AI and data science problems. In his free time, he continues working on closing the mind-body explanatory gap, and using that knowledge to improve machine learning and artificial intelligence.