S3 – jack of all trades master of some

Big Data for The Rest of Us. Affordable and Modern Business Intelligence Architecture – Adding Lifecycles to your S3 buckets to save cost and retain data forever!

Vimal Vachhani — Tue, 25 Sep 2018 01:14:50 +0000

I wanted to keep this post short since as I mentioned in the previous post about cloud storage, our use case is already an affordable one, but it still makes sense to touch on some of the file movement strategy to other tiers of storage to make sure we are maximizing our cost saving vs. our base level requirements. In S3, you can easily define life cycles rules that allow you to move your files from standard storage, to infrequent and eventually cold storage. The different pricing structures can be found on AWS’s documentation located here.

Let’s quickly discuss the different types of classes offered by AWS. For more details, please see here.

Standard

Amazon S3 Standard offers high durability, availability, and performance object storage for frequently accessed data. Because it delivers low latency and high throughput, S3 Standard is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and Big Data analytics.

Infrequent Access

Amazon S3 Standard-Infrequent Access (S3 Standard-IA) is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make S3 Standard-IA ideal for long-term storage, backups, and as a data store for disaster recovery.

Glacier (Cold Storage)

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. You can reliably store any amount of data at costs that are competitive with or cheaper than on-premises solutions. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three options for access to archives, from a few minutes to several hours.

In our basic scenario where we are getting files representing a whole years’ worth of data, we can consider a scenario where data older then 3 years is moved to infrequent storage while data older then 7 years is moved to cold storage. This way when basic day to day analytics and analysis is occurring, it is against S3 standard, whereas analytics for more directional items such as company direction which occurs less frequently (maybe once a quarter) can occur from the infrequent buckets. If in the rare case an audit is required during an event such as a merger and acquisition, data can then be pulled and provided from cold storage to meet those needs as well. This allows you to maximized cost savings while having an infinite data retention policy.

To set up a lifecycle policy on your data, just follow the few simple steps.

From your bucket, navigate to Management -> Add Lifecycle Rule

Give your Rue a name and a scope. You can use tags if needed if you need the policy to apply t a certain set of files.

In the transition and Expiration sections, you can define the time period for when data is moved to another tier and when to expire the file if that is required as well. It is a good idea to tag a file on the upload so that similar tags can be used for the lifecycle rules as well.

That’s it, you now have an automated policy on your data without an ETL or File task process. You can create as many policies as you need and each can be unique to how you wish to retain your data.

Big Data for The Rest of Us. Affordable and Modern Business Intelligence Architecture – Auto uploading and syncing your data using AWS S3

Vimal Vachhani — Tue, 18 Sep 2018 01:58:18 +0000

The first process in any data warehouse project is obtaining the data into a staging environment. In the traditional data warehouse, this required an ETL process to pick up data files from a local folder or FTP, and in some cases, a direct SQL connection to source systems to then load into a dedicated staging database. In the new process we will be defining we will be using an S3 bucket to be our new staging environment. Staging data will forever live in raw file data, as any analysis or query needed against this data will be handled via tools like Athena or Elastic Map Reduce which we will cover later. Keeping this data in S3 (known as Azure Blog storage in the Microsoft world) is a cheap and convenient way to store massive amounts of data for a relatively low cost. For example, the first 50 TB is stored at a cost of $0.023 per GB. In most use cases we can assume our data requirements stay under this threshold so if we are assuming 1 TB of data files, we can ball park around $23 a month for storage or $276 a year. Pretty cheap.

S3 also offers classes of storage that get cheaper with less readability needs. Data can be moved from S3 Standard which is high availability to S3 infrequent access which is cheaper but also has slightly slower read/writes. Very old data can then be moved to S3 Glacier which is the cheapest of the classes but also is very slow on access and restore. Processes can be defined around your data rules to maximize your storage vs. cost. We will walk through that process in a later post as well. For now, we will work in S3 Standard since the cost is already low.

So, let’s get into it. Here are the steps required to automate data from a local folder to you first S3 bucket!

You will need to create your first S3 bucket in the AWS console. For more details, please see the AWS documentation. We will call this bucket “DataDump002”. No need for anything fancy just yet.

The next step will be setting up the right access to the bucket from a newly created IAM user. For this, navigate to IAM in AWS and create a new user. I called mine “AutoUploader”, put it in a group called “AutoUploaderGroup” and set its permission to “AmazonS3FullAccess”
1. Be sure to copy the ARN as well as the Key and Secret Key.

From here, navigate back to your S3 Bucket and click on your bucket to get to the details and click on Bucket Policy. This is where you will define the JSON string that grants the correct permissions for read write. This step is a bit tricky as the Policy Generator in the IAM tool can create some odd rules and cause some errors if the bucket is empty vs. a dummy file being in place. Please see my JSON below for a working sample. Replace what is in BOLD for the ARN for your IAM user and the bucket. If you get errors saving this rule, remove all items from the Action section except the Get and Put, and add them back in once you have data in the bucket at a later time.

JSON Policy for Bucket

{

“Version”: “2012-10-17”,

“Id”: “Policy1537137960662”,

“Statement”: [

{

“Sid”: “Stmt1537137433831”,

“Effect”: “Allow”,

“Principal”: {

“AWS”: “arn:aws:iam::123456789:user/AutoUploader“

“Action”: [

“s3:Get*”,

“s3:Put*”,

“s3:List*”,

“s3:ListBucket”,

“s3:ListBucketVersions”

“Resource”: “arn:aws:s3:::dailydatadump02“

}

]

}

That should be it for the set-up portion on AWS. For the next section you will need to download and install AWS Command Line Interface which can be found using a Google Search. Once installed, you will need to restart your machine.

Once restarted, run the command “AWS Config”.

You will be promoted for your key and secret key from Step 2.
Enter the region your AWS was set up in. The S3 buckets do not require a region. Mine was set to us-east-1.
Set output to “text”

For this example I created a local folder on my C drive and dropped 3 big data files I downloaded with data surrouding the world happiness index which is free to download from https://www.kaggle.com/

From here all you need to know is the following two commands.
1. aws s3 sync . s3://dailydatadump to sync files from your local folder to your bucket
2. aws s3 sync s3://dailydatadump . to sync files from your S3 bucket back to your local folder
3. The sync command works a lot like the XCOPY command in regular BAT file command prompt. Files that exist and are unchanged will be skipped and save you the processing time.

- That’s it! A few small tweaks and this can now be set as BAT file and scheduled to run every few minutes to keep data files flowing to your S3 bucket for all your future data analysis needs!

Big Data for The Rest of Us. Affordable and Modern Business Intelligence Architecture – An Introduction using AWS

Vimal Vachhani — Mon, 17 Sep 2018 20:54:18 +0000

If you google the use cases for Big Data, you will usually find references to scenarios such as web click analytics, streaming data or even IOT sensor data, but most organizations data needs and data sources never fall into any of these categories. However, that does not mean they are not great candidates for a Modern Big Data BI solution.

What is never mentioned in the use cases above is the cost, which can get astronomical. Most businesses do not require that level of horse power but can still leverage the new technologies to create a data lake and data warehouse for a fraction of the cost. If done right, we can have a solution that is more scalable and cheaper than traditional server-based data warehouses. This allows organizations to future proof their data needs as they may have “medium” data right now but expect to grow into the big data space later.

In the next series of blog post, were going to investigate the steps on creating the basic building blocks of a modern Business Intelligence solution in AWS as well as keep an eye on cost and resourcing. As a current production server VM with 1TB of space and 8GB of RAM, including database licensing runs around 20K a year, we will set this as a baseline to see if we can build out a solution that is close or cheaper. Check our the first blog post for the first in the series!