咽炎吃什么药好使| 火耗归公是什么意思| 检查骨密度挂什么科| 10月19是什么星座| 腰果是什么树的果实| 割痔疮后吃什么恢复快| 人为什么要拉屎| 湿气重是什么原因引起的| 刺瘊子是什么原因造成的| 咽喉炎用什么药| 干咳嗽吃什么药| 仰卧起坐有什么好处| 胃食管反流病吃什么药| 乖戾是什么意思| 心脏传导阻滞是什么意思| 肺炎吃什么药好| 梦见蛇和鱼是什么意思周公解梦| 红色和什么颜色搭配好看| 缪斯女神什么意思| 晚上右眼跳是什么预兆| 吃维生素e软胶囊有什么好处| 五花肉炒什么配菜好吃| 大便培养是检查什么的| 豆汁是什么做的| agoni什么意思| 控制血糖吃什么食物| 姓袁女孩叫什么名字好听| 心脏缺血吃什么补的快| 1948年中国发生了什么| 仓鼠能吃什么东西| 真菌感染是什么| arg是什么氨基酸| 肝癌晚期什么症状| 一个虫一个尧念什么| 合羽念什么| 婴儿呛奶是什么原因引起的| 行李为什么叫行李| 移花接木什么意思| 颞下颌关节挂什么科| 10月15号是什么星座| 乖戾是什么意思| 仓鼠能吃什么水果| 身体老是出汗是什么原因| 小孩便秘有什么办法| 喝酒前喝什么不容易醉又不伤胃| 青少年长白头发是什么原因| 同房后为什么会出血| 静水流深什么意思| 唵是什么意思| 安全套是什么| 景泰蓝是什么地方的特种工艺| 青岛有什么特产| 眼轴是什么意思| 女人大姨妈来了吃什么最好| 什么是代词| 人生三件大事是指什么| 鼻窦炎是什么样子的| 血常规一般查什么病| 阴是什么意思| 女人梦见狼是什么预兆| 风热感冒用什么药好| 左手中指痛什么预兆| 丙肝为什么会自愈| 1943年属羊的是什么命| 金钱能买来什么但买不来什么| 巴郎子是什么意思| 女生为什么会痛经| 榴莲苦是什么原因| 黄占读什么| 灵芝和什么煲汤好| 包面是什么| 取环后应该注意什么| 结婚20年是什么婚姻| 富三代是什么意思| 体液是什么| 百花齐放是什么生肖| 牛奶为什么能解辣| 热毒吃什么药| 老婆饼为什么叫老婆饼| 长颈鹿的脖子为什么那么长| 点痣用什么方法最好| 步兵什么意思| 不怀孕是什么原因引起的| 土加亥念什么| 胡萝卜什么颜色| 南昌有什么好玩的| 肝掌是什么原因引起的| 电压mv是什么意思| 什么时候是情人节| 结婚60年是什么婚| 拌凉菜需要什么调料| 小孩吃什么水果好| 桫椤是什么植物| 前三个月怀孕注意什么| 什么开窍于耳| balance什么意思| 肌酐高什么原因引起的| 胆囊结石吃什么药| 心理障碍是什么病| 英雄难过美人关是什么生肖| 什么是一线城市| 老舍原名什么| 为什么会鼻塞| 舒肝解郁胶囊治什么病| 强磁对人体有什么危害| 孕期血糖高可以吃什么水果| 泰国有什么好玩| 枫字五行属什么| 常吃黑芝麻有什么好处和坏处| 儿童正常体温在什么范围| 小孩白细胞高是什么原因| 比利时用什么货币| 朋友越来越少暗示什么| 亨廷顿舞蹈症是什么病| 晚上起夜尿多吃什么药| 球蛋白的功效与作用是什么| eb病毒是什么病| 酒后喝什么解酒| 马与什么属相相克相冲| cp是什么单位| 脸书是什么| 心肌标志物是查什么的| 江西景德镇有什么好玩的地方| 天上的星星为什么会发光| prbpm是什么意思| 理数是什么| bag是什么意思| 什么花最香| 白斑用什么药膏| 灰枣与红枣有什么区别| 痤疮是什么意思| 清华什么专业最好| 歆字五行属什么| 圹是什么意思| ckd3期是什么意思| 枕头太低有什么影响| 夏季吃什么水果| mrr是什么意思| 术后吃什么营养品好| 吃土豆有什么好处和坏处| apc是什么牌子| 怀孕血压高对胎儿有什么影响| 梦到亲人死了是什么征兆| 大器晚成什么意思| 吃什么排铜最快| 足外翻挂什么科| 失温是什么意思| 维生素c十一什么意思| 梦见订婚是什么意思| 肺栓塞的主要症状是什么| 母仪天下是什么意思| 蛇是什么类动物| 硫黄是什么| 脾虚吃什么好| 坎坷是什么意思| 睡觉头出汗是什么原因| 肩胛骨痛是什么原因| lady是什么意思| 孕妇吸氧对胎儿有什么好处| 紫癜吃什么药| 晚上吃什么容易减肥| 博字五行属什么| 霏字五行属什么| 公鸭嗓是什么声音| 百忧解是什么药| 结婚登记需要什么材料| 扁桃体发炎什么症状| 葡萄胎有什么症状反应| 国家为什么重视合肥| 气阴两虚吃什么中成药| 什么是鸡冠油| 牙为什么会疼| 火龙果对身体有什么好处| 人才辈出是什么意思| 兆字五行属什么| 为什么长痣| 盆腔炎吃什么药最有效| 驾校体检都检查什么| 牙龈肿痛挂什么科| 扁桃体结石长什么样| 血脂厚有什么症状| la是什么牌子| 九月二十五是什么星座| 夜间睡觉口干是什么原因| 造纸术什么时候发明的| 小孩子流鼻血是什么原因引起的| 玉五行属性是什么| 尿酸高喝什么水最好| 关二爷是什么神| 什么是白平衡| 柴鸡是什么鸡| 阴囊湿疹挂什么科| bred是什么意思| 荠菜长什么样子图片| amass是什么牌子| 火热是什么意思| 头疼需要做什么检查| wonderful什么意思| 人为什么会得甲母痣| 牙齿深覆合是什么意思| 牛磺酸有什么作用| by是什么意思| 小腹左侧疼是什么原因| 立是什么生肖| 乳腺疼挂什么科| 肾虚吃什么食物| 浊气是什么意思| 提前吃什么药喝酒不醉| 豆浆不能和什么一起吃| 牛肉用什么调料| 结节是什么病| 老虎油是什么意思| 试孕纸什么时候测最准| 卡介苗为什么会留疤| 阿联酋和迪拜什么关系| 什么树木| 女人得痔疮原因是什么| 艾草泡脚有什么功效| a股是什么| 肝气郁结是什么意思| 男性做彩超要检查什么| 朋友妻不可欺是什么意思| 单车是什么意思| 淋巴细胞计数偏高是什么原因| 智商高的人有什么特征| 起眼屎是什么原因| 孩子发烧手脚冰凉是什么原因| 非特异性t波异常是什么意思| 二月十九是什么星座| 孩子第一次来月经要注意什么| 经常手淫会导致什么| 举的部首是什么| o和ab型生的孩子是什么血型| 血压高吃什么药最好| 订盟是什么意思| nk细胞是什么| 吃什么肝脏排毒| 大小脸是什么原因造成的| 取环后吃什么恢复子宫| 什么牌子的氨基酸洗面奶好| 是什么标点符号| 磨盘有什么风水说法| 蚊虫叮咬涂什么药| 脸上长黑痣是什么原因| 吃什么东西可以降压| 藠头是什么菜| 迂回战术什么意思| 功能是什么意思| 艾灸有什么作用| 甲功是查什么的| 8月15号是什么星座| 地狱不空誓不成佛是什么意思| 大脚趾发黑是什么原因| 扁桃体发炎吃什么食物| 清热利湿是什么意思| 张艺谋为什么不娶巩俐| 杨梅有什么功效| 好运是什么生肖| 梦见自己孩子死了是什么意思| 前戏是什么意思| 喉炎用什么药| 虾头部黄黄的是什么| 左侧肋骨下方疼痛是什么原因| 86年属虎是什么命| 百度
 

学习全国“两会”精神

百度 我作为一名业内人士,玩任何一个游戏都是一个学习的过程。

Learn how to level up your Data Pipelines!



By Nicholas Leong, Data Engineer, Writer



Image by Author



Everyone: What do Data Engineers do?
Me: We build pipelines.
Everyone: You mean like a plumber?



Something like that, but instead of water flowing through pipes,?data flows through our pipelines.

Data Scientists build models and Data Analysts communicate data to stakeholders. So, what do we need Data Engineers for?

Little do they know, without Data Engineers, models won’t even exist. There won’t be any data to be communicated. Data Engineers build warehouses and pipelines to allow data to flow through the organization. We connect the dots.

 
Should You Become a Data Engineer in 2021?
 

Data Engineer is the fastest-growing job in 2019,?growing by 50% YoY, which is higher than the job growth of Data Scientist, amounting to?32% YoY.

Hence, I’m here to shed some light on some of the day-to-day tasks a Data Engineer gets. Data Pipelines is just one of them.

 

ETL/ELT Pipelines

 
 

ETL — Extract, Transform, Load
ELT — Extract, Load, Transform


What do these mean and how are they different from each other?

In the data pipeline world, there is a?source?and a?destination. In the simplest form, the source is where Data Engineers get the data from and the destination is where they want the data to be loaded into.

More often than not, there will need to be some?processing?of data somewhere in between. This can be due to numerous reasons which include but are not limited to —

  • The difference in types of Data Storage
  • Purpose of data
  • Data governance/quality

Data Engineers label the processing of data as transformations. This is where they perform their magic to transform all kinds of data into the form they intend it to be.

In?ETL Data Pipelines, Data Engineers perform transformations before loading data into the destination. If there are relational transformations between tables, these happen within the source itself. In my case, the source was a Postgres Database. Hence, we performed relational joins in the source to obtain the data required, then load it into the destination.

In?ELT Data Pipelines, Data Engineers load data into the destination raw. They then perform any relational transformations within the destination itself.

In this article, we will be talking about how I transformed over 100+ ETL Pipelines in my organization into ELT Pipelines, we will also go through the reasons I did it.

 

How I Did It

 
 
Initially, the pipelines were ran using Linux cron jobs. Cron jobs are like your traditional task schedulers, they initialize using the Linux terminal. They are the most basic way of scheduling programs without any functionalities like —

  • Setting dependencies
  • Setting Dynamic Variables
  • Building Connections



Image by Author

 

This was the first thing to go as it was causing way too many issues. We needed to scale. To do that, we had to set up a proper Workflow Management System.

We chose?Apache Airflow. I wrote all about it here.

 
Data Engineering — Basics of Apache Airflow — Build Your First Pipeline
 

Airflow was originally built by the guys at?Airbnb, made open source. It is also used by popular companies like?Twitter?as their Pipeline management system. You can read all about the benefits of Airflow above.

After that’s sorted out, we had to change the way we are extracting data. The team suggested?redesigning our ETL pipelines into ELT pipelines.?More on why did we do it later.



Image by Author

 

Here’s an example of the pipeline before it was redesigned. The source we were dealing with was a Postgres Database. Hence, to obtain data in the form intended, we had to perform joins in the source database.

Select 
a.user_id,
b.country,
a.revenue
from transactions a 
left join users b on
a.user_id = b.user_id

 

This is the query ran in the source database. Of course, I’ve simplified the examples to their dumbest form, the actual queries were over 400 lines of SQL.

The query results were saved in a CSV file and then uploaded to the destination, which is a?Google Bigquery database?in our case. Here’s how it looked like in Apache Airflow —

This is a simple example of an ETL pipeline. It was working as intended, but the team had realized the benefits of redesigning this into an ELT pipeline. More on that later.



Image by Author

 

Here’s an example of the pipeline after it was redesigned. Observed how the tables are brought into the destination?as it is. After all the tables have been successfully extracted, we perform relational transformations in the destination.

--transactions
Select 
*
from transactions --
Select
*
from users

 

This is the query ran in the source database. Most of the extractions are using ‘Select *’ statements?without any joins. For appending jobs, we include?where conditions?to properly segregate the data.

Similarly, the query results were saved in a CSV file and then uploaded into the Google Bigquery database. We then made a separate dag for transformation jobs by?setting dependencies within Apache Airflow.?This is to ensure that all the extraction jobs have been completed before running transformation jobs.

We set dependencies using?Airflow Sensors.?You can read about them?here.

 
Data Engineering — How to Set Dependencies Between Data Pipelines in Apache Airflow
 

Why I Did it

 
 



Photo by?Markus Winkler?on?Unsplash

 

Now that you understand how I did it, we move onto the?why?— Why exactly did we re-wrote all our ETL into ELT pipelines?

 

Cost

 
 
Running with our old Pipeline had cost our team?resources, specifically time, effort, and money.

To understand the cost aspect of things, you have to understand that our source database (Postgres) was an ancient machine set up back in 2008. It was hosted on-prem. It was also running an old version of Postgres which makes things even complicated.

It wasn’t until recent years when the organization realize the?need?for a centralized data warehouse for Data Scientists and Analysts. This is when they started to build the old pipelines on cron jobs. As the number of jobs increase, it had drained resources on the machine.

The SQL joins written by the previous Data Analysts were also all over the place. There were over?20 joins?in a single query in some pipelines, and we were approaching 100+ pipelines. Our tasks began running during midnight, it usually finished about 1–2 p.m., which amounted to about?12+ hours,?which is absolutely unacceptable.

For those of you who don’t know, SQL joins are one of the?most resource-intensive commands?to run. It’ll increase the query’s runtime exponentially as the number of joins increases.



Image by Author

 

Since we were moving onto Google Cloud, the team understood that Google Bigquery is?lightning fast?in computing SQL queries. You can read all about it?here.

 
How fast is BigQuery? | Google Cloud Blog
 

Hence, the whole point is to only run simple ‘Select *’ statements in the source and perform all the joins on Google Cloud.

This had more than?doubled the?efficiency and speed?of our Data Pipelines.

 

Scalability

 
 



Photo by?Quinten de Graaf?on?Unsplash

 

As businesses scale, so do their tools and technologies.

By moving onto Google Cloud, we can easily scale our machines and pipelines without worrying much.

Google Cloud utilizes?Cloud Monitoring?which is a tool that collects metrics, events, and metadata of your Google Cloud Technologies like Google Cloud Composer, Dataflow, Bigquery, and many more. You can monitor all sorts of data points which includes but are not limited to —

  • Cost of Virtual Machines
  • The cost of each query ran in Google Bigquery
  • The size of each query ran in Google Bigquery
  • Duration of Data Pipelines

This had made monitoring a breeze for us. Hence, by performing all transformations on Google Bigquery, we are able to accurately monitor our query size, duration, and cost as we scale.

Even as we?increase?our machine sizes, data warehouses, data pipelines, etc, we completely?understand the costs and benefits?that come with it and have full control of turning it on and off if needed.

This had and will save us from a lot of headaches.

 

Conclusion

 
 



Photo by?Fernando Brasil?on?Unsplash

 

If you’ve read until this point, you must really have a thing for data.
You should!

We’ve already made ETLs and ELTs. Who knows what kind of pipelines we will be building in the?future?

In this article, we talked about —

  • What are ELT/ETL Data Pipelines?
  • How I redesigned ETL to ELT Pipelines
  • Why I did it

As usual, I end with a quote.


Data is the new science. Big Data holds the answers
—Pet Gelsinger


 

Subscribe to my newsletter to stay in touch.

 
You can also support me by signing up for a medium membership through?my link. You will be able to read an unlimited amount of stories from me and other incredible writers!

I am working on more stories, writings, and guides in the data industry. You can absolutely expect more posts like this. In the meantime, feel free to check out my other?articles?to temporarily fill your hunger for data.

Thanks?for reading! If you want to get in touch with me, feel free to reach me at nickmydata@gmail.com or my?LinkedIn Profile. You can also view the code for previous write-ups in my?Github.

 
Bio: Nicholas Leong is a data engineer, currently working in an online classifieds tech company. In his years of experience, Nicholas has fully designed batch and streaming pipelines, improved data warehousing solutions, and performed machine learning projects for the organization. During his free time, Nicholas likes to work on his own projects to improve his skills. He also write about his work, projects, and experiences to share them with the world. Visit my site to check out my work!

Original. Reposted with permission.

Related:



下下签是什么意思 心有余悸是什么意思 什么病 恭请是什么意思 他喵的什么意思
李讷为什么不姓毛 戴银镯子变黑是什么原因 同心同德是什么意思 抑郁症为什么会想死 宫寒吃什么
棺材一般用什么木头 乙肝全是阴性是什么意思 板带是什么 爱的真正含义是什么 丹青是什么
莱字五行属什么 喝什么茶降血糖 五彩斑斓是什么意思 十一月二十八是什么星座 安全生产职责是什么
赵云属什么生肖hcv8jop2ns9r.cn 为什么睡不着hcv8jop5ns9r.cn 一什么床hcv7jop5ns3r.cn 腺肌症有什么症状表现hcv8jop9ns0r.cn 飞蛾飞进家里预示什么hcv9jop3ns4r.cn
抽血化验能查出什么adwl56.com karl lagerfeld是什么牌子hcv7jop9ns9r.cn 蛞蝓是什么sanhestory.com 方方土是什么字hcv8jop4ns8r.cn 直爽是什么意思hcv9jop3ns0r.cn
吃什么子宫肌瘤会消除hcv8jop0ns0r.cn 什么什么自若hcv9jop6ns2r.cn 方形脸适合什么发型hcv9jop1ns0r.cn 标准偏差是什么意思youbangsi.com 什么是基因检测hcv7jop7ns1r.cn
塔利班是什么hcv7jop6ns9r.cn 没腿毛的男人代表什么hcv8jop7ns0r.cn 癸丑五行属什么hcv8jop7ns4r.cn 皮肤一碰就红是什么原因hcv7jop6ns6r.cn 数字8五行属什么hcv7jop9ns2r.cn
百度