Thursday, October 5, 2017

What I Find Hard In Doing Data Science Work - Survival Guide (Part 1.1)

Data Science is broad and it turns out that the hardest part in doing data science work is "data mining".

While you scrape and collect data from different sources, you need to determine which source holds the "source of truth" so the basis of your hypothetical theory is easily identified. While this is common to decision making, it turns out that this is not always true in data science.

What I realize in collecting data is that a source is just another point of failure. The bigger your scope, the more sources you have, the more discrepancy you'll have -- the more cross-checking you need to perform.
To make data actionable, it needs to be accessible, accurate and standardized.
Seeking for the correct values, one needs to figure out the "why" when inputs and outputs are shown right next to each other. While people rely on human intellect in performing judgments, where bias and error are at the 90% marginal rate -- in data science, you can't afford to be wrong. On the other hand, you can't afford not to know. And that's the reason why the data being collected should be reliable.

Types of data:
Data is everywhere. However, if we categorize the data into neutralization (aka form), it all boils down to two types. 




While you thought, that the one you should be paying attention to is towards "data you need", think again...

Some scenarios and cases don't give you the ability to nail down the data you are in need. So you're left with no option other than to create and generate it.



Extracting data is easy, generating data is complex.


Personal Experience:
While generating data is very rewarding, the story doesn't end there. Most common problem with data is "sorting" particularly "parsing". I don't have any good knowledge about excel sheets and other tools. Luckily, my bash skills can address most of the things I need.

If you have experience in sed, awk and regex -- you should be good.
Science is limited by data, Data is limited by Engineering
All set of tools are welcome, however, the main concern in executing the task will always be efficiency. Don't feel bad if you don't know how to do things in other ways (ie. like parsing data on excel sheets), instead, stick to what you know best and works.