Friday, February 9, 2018

Data Migration 101 (An Introduction to Talend) - Survival Guide (Part 2.1)

Doing data migration is really risky and tideous, especially when you are not using a tool that helps you create a pipeline for the data migration process.

Data migration is all about tooling and pipelining. So if you're ask to do data migration, your first question will be -- "Which tool should I use that best fits my need?".


The ideal tool for doing data migration should support the following:
  • Supporting multiple sources and type (sql, nosql, file, crm, api, 3rd party-services)
  • Parsing Mechanism
  • Market Standard (Healthy/Active Community)

After doing our homework and research, we have selected Talend to be our ETL tool for data migration.

For the times we've touchbase with the tool, we've managed to get things rolling and had perform the data migration successfully. Within those times as well, we've encountered trivial issues we thought might confused other users and so we've jot it down.

This will somehow give you an overview (starter-kit), on how things work "the talend way".


PRELUDE:
  • I strongly advise for you to be on Windows or OSX, there are unexplainable errors on the linux compiled version of the tool. 
  • Do not use a Windows Virtual Machine, file corruption and saved versioning is inconsistent for this kind of setup
  • In OSX, there are problems regarding the "JDK" versions compatible with talend version.
    • OSX High Seirra, should use JDK 1.8 151 release
    • OSX High Seirra, if page hangs and stops on the license loading screen -- try launching the tool via the terminal



NOTE #1:
PostgresDB in Heroku have SSL enabled setting. Which Talend at the moment, doesn't support. To work around the issue, here's the additional line you'll be adding on your connection string. See screenshot for a detailed instruction.

?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory&tcpKeepAlive=true



NOTE #2:
By the time connection is established to the database and schema is retrieved, talend will be parsing the schema as:

SELECT "database_name?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory&tcpKeepAlive=true"."database_table"."database_column" FROM "database_name?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory&tcpKeepAlive=true"."database_table"


NOTE #3:
If you try using any "table" on a StandardJob  (ie. PostgreSQL (table) > Talend (tMap) > Target Host (table)) running it generates an error message of:

ERROR: cross-database references are not implemented

Googling, doesn't help much on this. Even with the right keyword. Most likely, you'll be seeing search results pertaining to the PostgreSQL limitation or talend "general knowledge" pages. The workaround is altering how the "Query" is automatically set by Talend upon it retrieves the PostgreSQL schema.

It's useful to have the "Query" on the base form:

SELECT * FROM database_table




NOTE#4:
In mapping database fields from source to target host, be aware of your fields specially when it contains "isDeleted" in the naming, you can just remove it on Talend "Job Design Board" so it won't be read as it will generate an error.




NOTE #5:
Sometimes, you'll also bump into errors even when your mapping is right. If you have 10 fields from source and 30 fields on the target host and you map the 10 fields to the target host, remove the 20 unused field on "tMap".


BONUS:
Doing data migration is very time consuming -- especially when you have a slow internet connection. It's best for you to run your ETL tool on AWS Workspace or something like Paperspace.

Hope you'll be able to perform the data migration task using this powerful tool!
And maybe we can share more technical in-depth details regarding the "data migration we've performed".

Happy administration!