Thursday, January 11, 2018

Handling Data And What To Take Note - Survival Guide (Part 2)



Let me remind everyone that data engineering is not equal to data science but both are part of "Big Data". This article mainly focuses on data engineering and how to store data to be more useful for analysis.



The process of storing data into a single place is called warehousing and data warehousing is within the scope of data engineers. Preparing data so it gets served when it's needed is crucial for data-driven companies, thus, making sure nothing is missed and messed up is the top priority in performing data engineering.

The realm of databases:
Modern applications are relying heavily on databases to store information.

Before even starting to build your platform, it is necessary for developers to evaluate which database they should be using. As database plays an important role not only for data storage but also integration, missing the key element (which is optimization and scalability) on the implementation will bring a negative impact to the entire operation.

After the selection process, what you need to be aware of?
  • Normalize Data - Not unless you have enough computing power, always normalize the data before you store it in the database. This is to make sure that records are "uniform" as it gets warehoused.
  • Treat "numbers" correctly - When you store "numbers" in the database, make sure to classify the representation. Numbers can be in a form of money, coordinates, occurance and etc.. Set the data types accordingly.
  • Single format for "dates" - There will be times that on a database, multiple tables handle "dates" and if those tables are not created at the same time, setting the "date" data type might be different (normally caused by negligence, lack of documentation or wrong documentation).
  • Using TEXT over VARCHAR(xxx) - While there's nothing wrong with using VARCHAR, it's important that one should be aware of limitations and usage. Say you have a field named "notes" or "reminder" and you set it to VARCHAR(255). If a user is very explicit about his "writing" and jots all the details, since your field only accepts 255 characters -- who's to be blamed for the lost data? The user who's too keen on the details? The application that only receives 255 character? The developer who sets the limitation to the field?
  • Be careful of using "id"- While "id" is human-readable, the time it gets stored in the database -- it then causes someone confusion. Be more specific when you store IDs (name it like internal_app_id or app_id and avoid the naming convention id1 or id_1).
  • Follow the basics - Do not name any field with a "reserved keyword". Guidelines are set to be followed and not to be ignored. This is the best part in  learning your database of choice.
  • Support JSON - modern databases including SQL supports JSON. This is powerful when used properly and accordingly. At the same time, if data is not well presented/structured, it'll make your record a whole lot of junk.

The realm of files/documents:
Support legacy systems too! Flexibility is the key to every modern application.

Storing data in a file/doc format is still rampant even in today's web era. Not because of lack of innovation but because of complexity and overheads. Every data is valuable and so every source (ie. CSV, EXEL, XML) should be supported.

If you plan to support these on your platform, what you need to be aware of?
  • Increased limit - Most legacy systems keep everything in a single file. The file might be around >10G in size and your app might timeout while in the process of acquiring the data from the source "file/doc".
  • Parsing Mechanism - It's vital to have a safe play when supporting files. Tell your application when to treat a "blank" as null and "blank" as "double quotes". Tell your application when to remove a space and when to leave it as is.

The realm of data analytics:
While databases are a great place to store data, it's not substantial to address massive query being batched by "analytics tools". Especially those that supports real-time data streaming.

Setting up a database engine to run data analytics tasks, has been made easier by cloud providers. Players like Google's Spanner, Amazon's Redshift, and Azure's Data Warehouse have been widely used by many and has been widely supported by most of the analytics tool providers.

In using a cloud solution, what you need to be aware of?

  • Check for data integrity - Upon syncing the data from the database engine (ie. Postgres/MySQL) to these cloud services, it's important that data remains as it is -- in terms of structure, size, and format. Using native applications like what Amazon's Data Migration Service to migrate RDS data to Redshift is an advantage (rather doing it manually or via 3rd party tools).

I've seen many lapses as I perform data migration. I hope these pointers will matter to you, the next time you setup a data storage for your application.