Sinking A Mine Shaft Into Data
Procedure Reveals Hidden Patterns
December 26, 2003
By ROBERT A. FRAHM, Courant Staff Writer
NEW BRITAIN -- As businesses, government, universities and others churn out a relentless and growing flood of information, you never know what gem might be hidden in the data.
A promising sales plan, perhaps? A winning NBA game strategy? A terrorist plot?
Whatever it is, Daniel T. Larose can help you find it.
Larose, a statistics professor at Central Connecticut State University, is training a new generation of specialists to tap into vast amounts of data using high-powered computers and a promising technique known as data mining.
Larose's popular Internet-based program - the world's only online master's degree program in data mining, according to Central - attracts students from all over the United States and several foreign countries. Larose introduces them to statistical models that can extract information from databases too massive for conventional analysis.
Data mining - the use of technology to recognize obscure or hidden patterns in giant masses of data - has been in the news lately, mostly as a tool to track terrorists. But Larose says the technique could benefit everyone from booksellers studying readers' buying habits to pharmaceutical companies looking for a cure for cancer.
"It's going to be everywhere," said Larose.
"We are ... inundated with data in most fields," he writes in a draft of a book about data mining. "The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge."
The field has intrigued scientists and statisticians trying to make sense of the information explosion fueled by the Internet and increasingly powerful technology over the past decade.
"Scientific satellites send terabytes of data every day. Some send terabytes of data every minute," said Gregory Piatetsky-Shapiro, editor of KDnuggets.com and KDnuggets News, a Boston-based website and newsletter about data mining. A terabyte, he says, is roughly equivalent to the data in a million Yellow Book phone directories.
"There is all this information - how to make sense of it?" he said.
At Central, Larose began teaching the online courses two years ago, the same year that Technology Review, a magazine affiliated with the Massachusetts Institute of Technology, listed data mining as one of 10 new technologies that will change the world. The emerging field gained new attention after the Sept. 11 attacks because of its potential for spotting new terrorist plots.
Using data mining software, investigators, for example, could sort through debit and credit card records, flight school rosters, airline ticket bookings, immigration records and personal data such as age and ethnicity to spot obscure patterns leading to potential terrorists or suspicious activities.
Critics fear that such powerful snooping could lead to abuses and violations of privacy rights. Earlier this year, Congress canceled a Pentagon anti-terror project called Total Information Awareness out of concerns that the data mining project might be used to spy on ordinary citizens.
"It's not that data mining is in itself bad," Larose says, "but it can be put to pernicious uses, just like any other tool."
A more benign use of data mining, Larose says, is in professional sports.
At every NBA game, for example, statisticians track shots, rebounds, misses, assists and fouls, noting such things as the location of players on the floor, player substitutions and the time clock.
"Basketball definitely lends itself to data mining because there is so much data," said Jay Wessel, senior director of technology for the Boston Celtics.
The Celtics use software to analyze trends from statistics gathered at thousands of NBA games to develop game strategies, such as how to defend a specific player or what combination of players to use against a particular lineup by an opponent, he said.
"Some of the same code developed for banking we use to look for data," he said.
In Larose's online classes, you can learn about "neural networks" and "classification and regression trees" - some of the techniques used to analyze huge sets of data.
At $425 per credit, data mining courses are Central's most expensive online classes, but they remain popular.
In the past two years, about 250 students from 20 states and nine foreign countries have taken some of the classes. About 50 are in the full master's degree program. Many already have master's or doctoral degrees. The roster suggests a wide range of interests.
"They're in banking, in pharmaceuticals, in insurance," Larose said. Some are independent consultants, some in medical jobs, others in the military, he said.
One of Larose's students, Rafiqul Islam, works for a biotechnology company. "The best use of data mining, I think, will come in the field of biological science," said Islam, of West Haven. "Previously, a scientist would examine one gene, one protein, one drug at a time. Now we have tools that can analyze thousands of drugs in one experiment."
Bruce Kolodziej of Middletown, N.J., works as an analyst for AT&T and is taking his third data mining course in Central's online program.
"Between natural science, banking, insurance, pharmaceuticals, telecommunications - the question is not, `Where can [data mining] be applied?'" he said. "Where can't it?"