Migrating a Big Data Warehouse to the Cloud
By Adam Haines
In today’s world, big data has become the norm. But transitioning big data sets from on-premises systems to a cloud-based system is cumbersome and fraught with challenges.
At Federated Sample, our business focuses on ensuring that the market research industry secures quality sample respondents for its research. The studies often have complex quotas and highly defined targeting requirements, which can make finding the right sample an involved process.
Initially, we stored our data in on-premises Microsoft SQL Server systems. However, as our company grew and requests for information increased, running queries against these transactional systems created a bottleneck. To overcome this challenge, we looked for an end-to-end data warehousing solution that would enable us to increase efficiency and more effectively meet the real-time reporting and business intelligence needs of our customers.
Since Amazon Redshift did not require a significant capital investment and we were already an Amazon Web Services (AWS) shop, that was the logical data warehouse choice for us. But it was only part of the equation; we still needed to migrate data from our Microsoft SQL Server systems to the cloud-based Redshift data warehouse.
As we made our transition to the cloud, the following lessons became clear:
Before you start migrating to the cloud, investigate third-party solutions.
As a cost-effective measure, our team built a homegrown solution that required the use of SQL Server Integration Services (SSIS), with data traveling through multiple stages for encoding and adaptation for direct copying into Redshift. Unfortunately, the solution was labor-intensive and rife with problems: The file-encoding process was error-prone, and only small files could be sent, with no way to identify or track errors. Our initial upload took more than a week, and subsequent updates took hours.
After a month of this complex and time-consuming trial, we knew that we needed a more scalable and sound approach. What we didn’t realize initially was that we had access to the AWS ecosystem. As an example, we used Attunity CloudBeam to quickly and easily transfer data from our SQL Server systems to our Redshift data warehouse. Our entire data set loaded in hours rather than days, and it’s no longer necessary to babysit the data-transfer process.
Attunity CloudBeam and Jaspersoft, a solution we use for business intelligence, are AWS partners that played key roles in our onboarding to Redshift. Today, with the system up and running, our data is automatically synced, providing real-time information feeds that have minimal impact on our production environment. More importantly, our account managers can run queries in Amazon Redshift and get customers the information they need quickly in order to close more business.
Automate data migration and syncing.
I can’t stress this enough: There is no need to manually migrate your data. Unless you’re working with very small data sets (under 300GB) and small transaction loads, you will regret not automating.
In our case, with our homegrown solution, I was essentially working around the clock to oversee data migration and synchronization. Based on my salary, the cost of our third-party solution paid for itself in less than a month. It was well worth the investment, since I can now focus on higher-value projects.
If possible, do a proof-of-concept (POC) with your cloud provider and other third-party solution providers.
We did a POC with Attunity to be certain that we could load our full database in a reasonable amount of time. That gave our team the confidence to move forward, and it helped us secure the budget for the project.
Our move to Amazon Redshift allowed us to thin out our production server. Our overall data store grows by about 10GB a day, and our goal is to keep the SQL server system at about 500GB. Everything else is now stored in Redshift. This reduction in on-premises storage has helped us save thousands of dollars per year, while greatly improving the performance of our production server.
Don’t assume default settings on third-party solutions will work well for your environment. Chances are they won’t.
Vendor products typically have default settings that cater to a broad spectrum of environments. These settings may or may not be optimal for your environment. It is essential to work with the vendor to learn how to correctly tune the product so the tool fits your specific environment.
Attunity CloudBeam has a good set of default options in its change data capture (CDC) feature. However, in our case, the default settings hindered our ability to replicate data as quickly as we wanted. So we worked with Attunity to adjust the CDC batch-tuning default value to reduce replication latency to the desired level.
The best advice I can give you is to test, test and then test some more. You’ll be glad later that you took the time to do it.
Now that we’re up and running with Amazon Redshift, I can vouch for the numerous benefits it offers. We are extremely happy with our implementation, and I’m now going to reserve my long-awaited vacations.
Adam Haines, director of systems engineering at Federated Sample, has more than 10 years of industry experience. He is a Microsoft Certified Master (MCM), a Microsoft Certified Solutions Master (MCSM) and a former Microsoft Most Valuable Professional (MVP).