Managing & Securing Data for the World's Families
It's not easy being the CTO of a company that has a 10 petabyte database with 13 billion structured and unstructured records going back to the 1300s—a number that grew by 1.2 billion documents in 2013. Then add a paying subscriber base of 2.7 million people around the world who generate an average of 75 million searches a day on the company's various Websites, including Ancestry.com, MyFamily.com, FamilyTreeMaker.com and Genealogy.com.
That's the challenging job that Scott Sorensen took on last April when he became the CTO of Ancestry.com, purportedly the world's largest online resource of family histories.
In its quest to continue expanding and enhancing its enormous database of family information, the company launched AncestryDNA in May 2012. The database currently has DNA from more than 300,000 people, who get information on which of the 26 regions of the world their ancestors came from. To determine someone's ethnicity, AncestryDNA has to analyze 700,000 markers—clearly a big data analytics initiative.
The system also can match an individual with his or her fourth cousins with an average accuracy of 95 percent. To deliver these DNA-based cousin matches, Sorensen's team rewrote the Germline DNA matching algorithm using Hadoop and HBase, and called their scalable implementation “Jermline.”
Before Ancestry.com switched to Hadoop and Jermline, it took 24 hours to process matches from a pool of 60,000 DNA samples. However, to get up to 120,000 samples, each additional 1,000 sample set would have required 700 hours (more than four weeks) to complete.
"We knew we had to change that," Sorensen recalls. "As we got more subscribers and records, we had to keep adding servers, and that was expensive and hard to scale.
"So, in the summer of 2012, we switched to a Hadoop cluster, which lets us scale up by adding inexpensive commodity hardware. Then we parallelized the processing, and cut the processing time down from four weeks to less than a day."
Both historical records and customer behavior data are stored in the Hadoop system. Machine learning that leverages Hadoop is used to create algorithms that link historical records with its subscribers' family trees.
To manage this growing business, Ancestry.com has 1,400 employees, including more than 400 who work for Sorensen as engineers, application developers, data scientists, bio informatics experts, DNA scientists, Hadoop specialists and product managers.
Finding professionals with the right skill sets has been a challenge. "We hired a data scientist—a specialty that's hard to find—and found existing staff who had statistics and programming backgrounds and trained them in machine learning and predictive analytics," Sorensen says.
Another challenge facing the company involves security and privacy issues. With so much personal information in its database, Ancestry.com is rigorous about security.
"All PII [personally identifiable information] is kept separate from our DNA data," Sorensen explained. "Then we use encrypted tokens to link personal information with DNA data.
"We also keep up with changing rules and regulations in different parts of the world. For instance, we comply with the European Union's privacy rules, as well as Canada's regulations."
By implementing innovative technology and hiring skilled staff, Sorensen and his team support and enhance Ancestry.com's mission of helping millions of people find information about their family's history.
Photo courtesy of Ancestry.com