Sinking A Mine Shaft Into Data
Procedure Reveals Hidden Patterns
December 26,
2003
By ROBERT A.
FRAHM, Courant Staff Writer
NEW BRITAIN
-- As businesses, government, universities and
others churn out a relentless and growing flood of information, you never know
what gem might be hidden in the data.
A promising sales plan, perhaps? A winning NBA game strategy? A terrorist plot?
Whatever it is, Daniel T. Larose can help you find it.
Larose, a statistics professor at Central Connecticut State University, is
training a new generation of specialists to tap into vast amounts of data using
high-powered computers and a promising technique known as data mining.
Larose's popular Internet-based program - the world's only online master's
degree program in data mining, according to Central - attracts students from all
over the United States and several foreign countries. Larose introduces them to
statistical models that can extract information from databases too massive for
conventional analysis.
Data mining - the use of technology to recognize obscure or hidden patterns in
giant masses of data - has been in the news lately, mostly as a tool to track
terrorists. But Larose says the technique could benefit everyone from
booksellers studying readers' buying habits to pharmaceutical companies looking
for a cure for cancer.
"It's going to be everywhere," said Larose.
"We are ... inundated with data in most fields," he writes in a draft
of a book about data mining. "The problem is that there are not enough
trained human analysts available who are skilled at translating all of this data
into knowledge."
The field has intrigued scientists and statisticians trying to make sense of the
information explosion fueled by the Internet and increasingly powerful
technology over the past decade.
"Scientific satellites send terabytes of data every day. Some send
terabytes of data every minute," said Gregory Piatetsky-Shapiro, editor of
KDnuggets.com and KDnuggets News, a Boston-based website and newsletter about
data mining. A terabyte, he says, is roughly equivalent to the data in a million
Yellow Book phone directories.
"There is all this information - how to make sense of it?" he said.
At Central, Larose began teaching the online courses two years ago, the same
year that Technology Review, a magazine affiliated with the Massachusetts
Institute of Technology, listed data mining as one of 10 new technologies that
will change the world. The emerging field gained new attention after the Sept.
11 attacks because of its potential for spotting new terrorist plots.
Using data mining software, investigators, for example, could sort through debit
and credit card records, flight school rosters, airline ticket bookings,
immigration records and personal data such as age and ethnicity to spot obscure
patterns leading to potential terrorists or suspicious activities.
Critics fear that such powerful snooping could lead to abuses and violations of
privacy rights. Earlier this year, Congress canceled a Pentagon anti-terror
project called Total Information Awareness out of concerns that the data mining
project might be used to spy on ordinary citizens.
"It's not that data mining is in itself bad," Larose says, "but
it can be put to pernicious uses, just like any other tool."
A more benign use of data mining, Larose says, is in professional sports.
At every NBA game, for example, statisticians track shots, rebounds, misses,
assists and fouls, noting such things as the location of players on the floor,
player substitutions and the time clock.
"Basketball definitely lends itself to data mining because there is so much
data," said Jay Wessel, senior director of technology for the Boston
Celtics.
The Celtics use software to analyze trends from statistics gathered at thousands
of NBA games to develop game strategies, such as how to defend a specific player
or what combination of players to use against a particular lineup by an
opponent, he said.
"Some of the same code developed for banking we use to look for data,"
he said.
In Larose's online classes, you can learn about "neural networks" and
"classification and regression trees" - some of the techniques used to
analyze huge sets of data.
At $425 per credit, data mining courses are Central's most expensive online
classes, but they remain popular.
In the past two years, about 250 students from 20 states and nine foreign
countries have taken some of the classes. About 50 are in the full master's
degree program. Many already have master's or doctoral degrees. The roster
suggests a wide range of interests.
"They're in banking, in pharmaceuticals, in insurance," Larose said.
Some are independent consultants, some in medical jobs, others in the military,
he said.
One of Larose's students, Rafiqul Islam, works for a biotechnology company.
"The best use of data mining, I think, will come in the field of biological
science," said Islam, of West Haven. "Previously, a scientist would
examine one gene, one protein, one drug at a time. Now we have tools that can
analyze thousands of drugs in one experiment."
Bruce Kolodziej of Middletown, N.J., works as an analyst for AT&T and is
taking his third data mining course in Central's online program.
"Between natural science, banking, insurance, pharmaceuticals,
telecommunications - the question is not, `Where can [data mining] be
applied?'" he said. "Where can't it?"