An undercover team of computer scientists reveals the practices of people who are paid to post on websites.
In China, paid posters are known as
the Internet Water Army because they are ready and willing to ‘flood’
the internet for whoever is willing to pay. The flood can consist of
comments, gossip and information (or disinformation) and there seems to
be plenty of demand for this army’s services.
This is an insidious tide. Positive
recommendations can make a huge difference to a product’s sales but can
equally drive a competitor out of the market. When companies spend
millions launching new goods and services, it’s easy to understand why
they might want to use every tool at their disposal to achieve success.
The loser in all this is the
consumer who is conned into making a purchase decision based on false
premises. And for the moment, consumers have little legal redress or
even ways to spot the practice.
Today, Cheng Chen at the University
of Victoria in Canada and a few pals describe how Cheng worked
undercover as a paid poster on Chinese websites to understand how the
Internet Water Army works. He and his friends then used what he learnt
to create software that can spot paid posters automatically.
Paid posting is a well-managed
activity involving thousands of individuals and tens of thousands of
different online IDs. The posters are usually given a task to register
on a website and then to start generating content in the form of posts,
articles, links to websites and videos, even carrying out Q&A
sessions.
Often, this content is pre-prepared
or the posters receive detailed instructions on the type of things they
can say. And there is even a quality control team who check that the
posts meet a certain ‘quality’ threshold. A post would not be validated
if it is deleted by the host or was composed of garbled words, for
example.
Having worked undercover to find out
how the system worked, Cheng and co then studied the pattern of posts
that appeared on a couple of big Chinese websites: Sina.com and
Sohu.com. In particular, they studied the comments on several news
stories about two companies that they suspected of paying posters and
who were involved in a public spat over each other’s services.
The Sina dataset consisted of over
500 users making more than 20,000 comments; the Sohu dataset involved
over 200 users and more than 1000 comments.
Cheng and co went through all the
posts manually identifying those they believed were from paid posters
and then set about looking for patterns in their behaviour that can
differentiate them from legitimate users. (Just how accurate were there
initial impressions is a potential problem, they admit, but the same
one that spam filters also have to deal with.)
They discovered that paid posters
tend to post more new comments than replies to other comments. They also
post more often with 50 per cent of them posting every 2.5 minutes on
average. They also move on from a discussion more quickly than
legitimate users, discarding their IDs and never using them again.
What’s more, the content they post
is measurably different. These workers are paid by the volume and so
often take shortcuts, cutting and pasting the same content many times.
This would normally invalidate their posts but only if it is spotted by
the quality control team.
So Cheng and co built some software
to look for repetitions and similarities in messages as well as the
other behaviours they’d identified. They then tested it on the dataset
they’d downloaded from Sina and Sohu and found it to be remarkably good,
with an accuracy of 88 per cent in spotting paid posters. “Our test
results with real-world datasets show a very
promising performance,” they say.
That’s an impressive piece of work
and a good first step towards combating this problem, although they’ll
need to test it on a much wider range of datasets. Nevertheless, these
guys have the basis of a software package that will weed out a
significant fraction of paid posters, provided these people conform to
the stereotype that Cheng and co have measured.
And therein lies the rub. As soon as
the first version of the software hits the market, paid posters will
learn to modify their behaviour in a way that games the system. What
Cheng and co have started is a cat and mouse game just like those that
plague the antivirus and spam filtering industries.
And that means, the battle ahead with the Internet Water Army will be long and hard.
Ref:
arxiv.org/abs/1111.4297: Battling the Internet Water Army: Detection of Hidden Paid Posters
No comments:
Post a Comment
Comments always welcome!