双侧卵巢显示不清是什么意思| 金刚菩提是什么植物的种子| 端坐呼吸常见于什么病| 小孩子注意力不集中看什么科| b型o型生出来的孩子什么血型| 前置胎盘是什么意思| 肠胃感冒吃什么食物| 记号笔用什么能擦掉| 艾叶煮水喝有什么功效| 头晕什么原因| 田七与三七有什么区别| 角是什么结构| 10月29号是什么星座| 吃什么长头发又密又多| 属猴与什么属相最配| 三月24号是什么星座的| 睁一只眼闭一只眼是什么意思| 为什么经常拉肚子| 7月1日是什么星座| 心悸心慌是什么原因| 有白带发黄是什么原因| 打鸡血是什么意思| 梦见棺材什么意思| 轻断食什么意思| mp5是什么| 劲旅是什么意思| 打乒乓球有什么好处| 梦见僵尸是什么预兆| 钼靶是什么检查| 1.24是什么星座| 草果在炖肉起什么作用| 沉香什么味道| 真菌镜检阴性是什么意思| 吃胡萝卜有什么好处| 不孕不育查什么项目| cea升高是什么意思| 生命线分叉是什么意思| 世界上最坚硬的东西是什么| 4t什么意思| 驾崩是什么意思| 什么叫西米| 蹉跎是什么意思| 吃什么死的比较舒服| 姚明什么时候退役的| 千丝万缕是什么意思| 什么叫免疫组化| 怀孕前3个月需要注意什么| 干咳嗽吃什么药| 膝盖内侧疼是什么原因| 什么食物对眼睛好| 指奸是什么意思| 点茶是什么意思| 幼儿睡觉出汗多是什么原因| 甲沟炎挂什么科| 蛋白粉和胶原蛋白粉有什么区别| 为什么月经不来| trust什么意思| 查三高挂什么科| 天秤女喜欢什么样的男生| 哈根达斯是什么| ph值高是什么原因| 什么叫水印| 戈谢病是什么病| 女生的逼长什么样| 频繁流鼻血是什么病的前兆| 当兵有什么好处| 怀孕白细胞高是什么原因| 什么什么本本| 总胆红素偏高是什么原因| 乳头痛什么问题| 十二生肖排第一是什么生肖| 得不偿失是什么意思| www指什么| 骨质疏松是什么症状| 尿酸高不能吃什么食物| 鳜鱼是什么鱼| 先天性心脏病最怕什么| 速度是70迈心情是自由自在什么歌| 吃什么卵泡长得快又圆| 需要一半留下一半是什么字| 圣诞节送女生什么礼物好| 减肥早上吃什么比较好| 什么是宫颈纳囊| 阴阳双补用什么药最佳| 11号来月经什么时候是排卵期| 50分贝相当于什么声音| 青少年额头长痘痘是什么原因| 陶渊明世称什么| 什么是马上风| 梦见儿子拉屎是什么意思| 一个口四个又念什么| 天河水命是什么意思| 为什么喝茶会睡不着| 吃什么能软化血管| 水火既济是什么意思| 舌炎吃什么药效果最好| 肠胃不好吃什么菜比较好| 黄芪和枸杞泡水喝有什么作用| 发痧是什么原因造成的| 驻外大使是什么级别| pth是什么| 木日念什么| 黄辣丁吃什么食物| 肺活量不足是什么症状| 喝什么解酒快| 法会是什么意思| 为什么叫犹太人| 男人为什么会遗精| 争奇斗艳的斗是什么意思| 容易脸红的人是什么原因| 什么是有机奶粉| 手为什么会长水泡| 歺是什么意思| 土命缺什么| 白目是什么意思| 简直了是什么意思| 孙耀威为什么被封杀| 手指甲空了是什么原因| 聚宝盆是什么意思| 膑是什么意思| 转氨酶高是什么原因引起的| 表面积是什么意思| 胸口疼痛吃什么药| 脚踝肿了是什么原因| 上火是什么意思| 电视黑屏是什么原因| 失恋什么意思| 6月21号是什么日子| 冬天手脚冰凉是什么原因怎么调理| 梦到自己拔牙齿是什么预兆| 急性荨麻疹吃什么药| 女性脉弦是什么意思| 牙体牙髓科看什么| 白完念什么| 卵巢早衰有什么症状| 小孩记忆力差需要补充什么营养| 孕妇过敏性鼻炎可以用什么药| 肺大泡是什么意思| 什么星座颜值最高| 杜仲泡水喝有什么功效| 梦见龙卷风是什么预兆| 单硬脂酸甘油酯是什么| 橄榄油什么牌子最好| 乌鸡白凤丸男性吃治疗什么| 风热是什么意思| 神经病吃什么药效果好| 眉毛长白毛是什么征兆| 得了梅毒会有什么症状| 霉菌用什么药效果好| 吃什么可以修复子宫内膜| 鱼鳞云代表什么天气| 冥寿是什么意思| 21属什么| 血栓是什么| 一千年前是什么朝代| 中耳炎不能吃什么食物| 子母环是什么形状图片| 补铁的药什么时候吃最好| 广东省省长是什么级别| 腰肌劳损用什么药| 吃什么食物帮助睡眠| 深度水解奶粉是什么意思| 滞是什么意思| 子宫内膜囊性增生是什么意思| 日语一库一库是什么意思| female什么意思| 胃炎应该吃什么药| 琛字五行属什么| 3人死亡属于什么事故| 儿童去火吃什么药| 小处男是什么意思| 染色体由什么组成| 低脂是什么意思| 人子是什么意思| 杀青原指什么| 意志力是什么意思| 季字五行属什么| 印度为什么用手吃饭| 心慌气短胸闷吃什么药| 举人相当于现在的什么官| 又拉肚子又呕吐是什么原因| 膀胱炎是什么症状表现| 红细胞计数偏低是什么意思| 钙片什么时候吃效果最好| 大张伟的真名叫什么| 乐色是什么意思| 吃什么不长肉| 怀孕有什么特征| 桦树茸泡水喝有什么功效| 小朋友坐飞机需要什么证件| 大麦茶有什么功效与作用| 蛇酒不是三十九开什么| 吃洋葱有什么好处| 五常大米是什么意思| 长期服用丙戊酸钠有什么副作用| 4月份什么星座| 番茄不能和什么一起吃| 为什么会突然长智齿| 胃癌手术后吃什么补品| 水痘不能吃什么食物| 焦虑症吃什么药好得快| 低血压什么症状| 胖次是什么意思| 知鸟是什么| 杏仁有什么好处| 被cue是什么意思| 牵牛花是什么颜色的| 顺从是什么意思| 手痛挂什么科| 散瞳快散和慢散有什么区别| 发际线是什么| 葛根泡水有什么功效| 成人睡觉磨牙是什么原因| 放屁多是什么原因| 眼睛发痒是什么原因| 为什么人会打嗝| 欣什么若什么| 什么是偶数| 知我者莫若你什么意思| 脖子肿是什么原因| 官杀旺是什么意思| cas号是什么意思| p0是什么意思| 豆皮炒什么好吃| 五味子是什么味道| ups是什么快递| 老放屁是什么原因| 切是什么偏旁| 戴字五行属什么| 口干是什么原因呢| 寄生树有什么功效作用| 实习期扣分有什么影响| 薛字五行属什么| 乙肝表面抗原高是什么意思| 肝实质回声密集是什么意思| 开车压到猫有什么预兆| 经常手淫会导致什么| 耳朵痛什么原因| 厄瓜多尔说什么语言| 西洋参什么时候吃效果最好| kerry英文名什么意思| 种马是什么意思| 免费查五行缺什么| 情人眼里出西施是什么心理效应| 来月经是黑色的是什么原因| 列席是什么意思| 什么狗不会咬人| 上朝是什么意思| 突然想吃辣是什么原因| 5月11日是什么星座| 情愫什么意思| 奶茶里面的珍珠是什么做的| 怎么知道自己对什么过敏| 银梳子梳头有什么好处和坏处| 华佗是什么生肖| 什么是清关| 新生儿睡觉突然大哭是什么原因| 心悸心慌吃什么药最好| ms是什么病| 什么时间进伏| 皮肤挂什么科| 三尖瓣反流是什么意思| 什么品牌的母婴用品好| 仙贝是什么意思| 百度
 

百度   相关规定还有:自治区政府新闻办负责搜集、整理、初步筛选网友留言报协调小组,并做好留言回复后网上舆情的反馈和正面声音放大工作;协调小组负责拟定回复话题,由自治区党委办公厅、自治区人民政府办公厅审定并转交当事部门办理;承办单位必须在15天内研究提出回复意见并报协调小组;一时难以解决的,要在回复中说明情况;对于把握不准的问题,须报上级部门审定后再予回复。

Using schema and lineage to understand the root cause of your data anomalies.



By Barr Moses, CEO and Co-founder of Monte Carlo & Ryan Kearns, Machine Learning Engineer at Monte Carlo

In this article series, we walk through how you can create your own data observability monitors from scratch, mapping to?five key pillars of data health. Part I can be found?here.

Part II of this series was adapted from Barr Moses and Ryan Kearns’ O’Reilly training,?Managing Data Downtime: Applying Observability to Your Data Pipelines, the industry’s first-ever course on data observability. The associated exercises are available?here, and the adapted code shown in this article is available?here.

As the world’s appetite for data increases, robust data pipelines are all the more imperative. When data breaks — whether from schema changes, null values, duplication, or otherwise — data engineers need to know.

Most importantly, we need to assess the root cause of the breakage — and fast — before it affects downstream systems and consumers. We use “data downtime” to refer to periods of time when data is missing, erroneous, or otherwise inaccurate. If you’re a data professional, you may be familiar with asking the following questions:

  • Is the data up to date?
  • Is the data complete?
  • Are fields within expected ranges?
  • Is the null rate higher or lower than it should be?
  • Has the schema changed?

To answer these questions in an effective way, we can take a page from the software engineer’s playbook:?monitoring and observability.

To refresh your memory since Part I, we define?data observability?as an organization’s ability to answer these questions and assess the health of their data ecosystem. Reflecting key variables of data health, the five pillars of data observability are:

  • Freshness: is my data up to date? Are there gaps in time where my data has not been updated?
  • Distribution: how healthy is my data at the field-level? Is my data within expected ranges?
  • Volume: is my data intake meeting expected thresholds?
  • Schema: has the formal structure of my data management system changed?
  • Lineage: if some of my data is down, what is affected upstream and downstream? How do my data sources depend on one another?

In this article series, we’re interested in pulling back the curtain, and investigating what data observability looks like —?in the code.

In?Part I, we looked at the first two pillars, freshness and distribution, and showed how a little SQL code can operationalize these concepts. These are what we would call more “classic”?anomaly detection problems?— given a steady stream of data, does anything look out of whack? Good anomaly detection is certainly part of the data observability puzzle, but it’s not everything.

Equally important is?context. If an anomaly occurred, great. But where? What upstream pipelines may be the cause? What downstream dashboards will be affected? And has the formal structure of my data changed? Good data observability hinges on our ability to properly leverage metadata to answer these questions — and many others — so we can identify the root cause and fix the issue before it becomes a bigger problem.

 

In this article, we’ll look at the two data observability pillars designed to give us this critical context —?schema?and?lineage. Once again, we’ll use lightweight tools like Jupyter and SQLite, so you can easily spin up our environment and try these exercises out yourself. Let’s get started.

 

Our Data Environment

 
This tutorial is based on?Exercises 2 and 3?of our O’Reilly course,?Managing Data Downtime. You’re welcome to try out these exercises on your own using a Jupyter Notebook and SQL. We’ll be going into more detail, including exercise?4, in future articles.

If you read?Part I?of this series, you should be familiar with our data. As before, we’ll work with?mock astronomical data?about habitable exoplanets. We generated the dataset with Python, modeling data and anomalies off of real incidents I’ve come across in production environments. This dataset is entirely free to use, and the?utils folder?in the repository contains the code that generated the data, if you’re interested.

I’m using SQLite 3.32.3, which should make the database accessible from either the command prompt or SQL files with minimal setup. The concepts extend to really any query language, and?these implementations?can be extended to MySQL, Snowflake, and other database environments with minimal changes.

Once again, we have our EXOPLANETS table:

$ sqlite3 EXOPLANETS.db
sqlite> PRAGMA TABLE_INFO(EXOPLANETS);
0 | _id            | TEXT | 0 | | 0
1 | distance       | REAL | 0 | | 0
2 | g              | REAL | 0 | | 0
3 | orbital_period | REAL | 0 | | 0
4 | avg_temp       | REAL | 0 | | 0
5 | date_added     | TEXT | 0 | | 0


A database entry in EXOPLANETS contains the following information:

0.?_id: A UUID corresponding to the planet.
1. distance: Distance from Earth, in lightyears.
2.?g: Surface gravity as a multiple of g, the gravitational force constant.
3.?orbital_period: Length of a single orbital cycle in days.
4.?avg_temp: Average surface temperature in degrees Kelvin.
5.?date_added: The date our system discovered the planet and added it automatically to our databases.

Note that one or more of?distance,?g,?orbital_period, and?avg_temp?may be?NULL?for a given planet as a result of missing or erroneous data.

sqlite> SELECT * FROM EXOPLANETS LIMIT 5;

Note that this exercise is retroactive — we’re looking at historical data. In a production data environment, data observability is real time and applied at each stage of the data life cycle, and thus will involve a slightly different implementation than what is done here.

It looks like our oldest data is dated 2020–01–01 (note: most databases will not store timestamps for individual records, so our DATE_ADDED column is keeping track for us). Our newest data…

sqlite> SELECT DATE_ADDED FROM EXOPLANETS ORDER BY DATE_ADDED DESC LIMIT 1;
2020–07–18


… looks to be from 2020–07–18. Of course, this is the same table we used in the past article. If we want to explore the more context-laden pillars of schema and lineage, we’ll need to expand our environment.

Now, in addition to?EXOPLANETS, we have a table called?EXOPLANETS_EXTENDED, which is a superset of our past table. It’s useful to think of these as the same table at?different moments in time. In fact,?EXOPLANETS_EXTENDED?has data dating back to 2020–01–01…

sqlite> SELECT DATE_ADDED FROM EXOPLANETS_EXTENDED ORDER BY DATE_ADDED ASC LIMIT 1;
2020–01–01


… but also contains data up to 2020–09–06, further than?EXOPLANETS:

sqlite> SELECT DATE_ADDED FROM EXOPLANETS_EXTENDED ORDER BY DATE_ADDED DESC LIMIT 1;
2020–09–06


 

Visualizing schema changes

 
Something else is different between these tables:

sqlite> PRAGMA TABLE_INFO(EXOPLANETS_EXTENDED);
0 | _ID            | VARCHAR(16777216) | 1 | | 0
1 | DISTANCE       | FLOAT             | 0 | | 0
2 | G              | FLOAT             | 0 | | 0
3 | ORBITAL_PERIOD | FLOAT             | 0 | | 0
4 | AVG_TEMP       | FLOAT             | 0 | | 0
5 | DATE_ADDED     | TIMESTAMP_NTZ(6)  | 1 | | 0
6 | ECCENTRICITY   | FLOAT             | 0 | | 0
7 | ATMOSPHERE     | VARCHAR(16777216) | 0 | | 0


In addition to the 6 fields in?EXOPLANETS, the?EXOPLANETS_EXTENDED?table contains two additional fields:

6.?eccentricity: the?orbital eccentricity?of the planet about its host star.
7.?atmosphere: the dominant chemical makeup of the planet’s atmosphere.

Note that like?distance,?g,?orbital_period, and?avg_temp, both eccentricity and atmosphere may be NULL?for a given planet as a result of missing or erroneous data. For example,?rogue planets?have undefined orbital eccentricity, and many planets don’t have atmospheres at all.

Note also that data is not backfilled, meaning data entries from the beginning of the table (data contained also in the EXOPLANETS table) will not have eccentricity and atmosphere information.

sqlite> SELECT
   ...>     DATE_ADDED,
   ...>     ECCENTRICITY,
   ...>     ATMOSPHERE
   ...> FROM
   ...>     EXOPLANETS_EXTENDED
   ...> ORDER BY
   ...>     DATE_ADDED ASC
   ...> LIMIT 10;
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |
2020–01–01 | |


The addition of two fields is an example of a?schema?change?— our data’s formal blueprint has been modified. Schema changes occur when an alteration is made to the structure of your data, and can be frustrating to manually debug. Schema changes can indicate any number of things about your data, including:

  • The addition of new API endpoints
  • Supposedly deprecated fields that are not yet… deprecated
  • The addition or subtraction of columns, rows, or entire tables

In an ideal world, we’d like a record of this change, as it represents a vector for possible issues with our pipeline. Unfortunately, our database is not naturally configured to keep track of such changes. It has no versioning history.

We ran into this issue in?Part I?when querying for the age of individual records, and added the?DATE_ADDED?column to cope. In this case, we’ll do something similar, except with the addition of an entire table:

sqlite> PRAGMA TABLE_INFO(EXOPLANETS_COLUMNS);
0 | DATE    | TEXT | 0 | | 0
1 | COLUMNS | TEXT | 0 | | 0


The?EXOPLANETS_COLUMNS?table “versions” our schema by recording the columns in EXOPLANETS_EXTENDED at any given date. Looking at the very first and last entries, we see that the columns definitely changed at some point:

sqlite> SELECT * FROM EXOPLANETS_COLUMNS ORDER BY DATE ASC LIMIT 1;
2020–01–01 | [
              (0, ‘_id’, ‘TEXT’, 0, None, 0),
              (1, ‘distance’, ‘REAL’, 0, None, 0),
              (2, ‘g’, ‘REAL’, 0, None, 0),
              (3, ‘orbital_period’, ‘REAL’, 0, None, 0),
              (4, ‘avg_temp’, ‘REAL’, 0, None, 0),
              (5, ‘date_added’, ‘TEXT’, 0, None, 0)
             ]sqlite> SELECT * FROM EXOPLANETS_COLUMNS ORDER BY DATE DESC LIMIT 1;
2020–09–06 | 
[
              (0, ‘_id’, ‘TEXT’, 0, None, 0),
              (1, ‘distance’, ‘REAL’, 0, None, 0),
              (2, ‘g’, ‘REAL’, 0, None, 0),
              (3, ‘orbital_period’, ‘REAL’, 0, None, 0),
              (4, ‘avg_temp’, ‘REAL’, 0, None, 0),
              (5, ‘date_added’, ‘TEXT’, 0, None, 0),
              (6, ‘eccentricity’, ‘REAL’, 0, None, 0),
              (7, ‘atmosphere’, ‘TEXT’, 0, None, 0)
             ]


Now, returning to our original question: when, exactly, did the schema change? Since our column lists are indexed by dates, we can find the date of the change with a quick SQL script:

Here’s the data returned, which I’ve reformatted for legibility:

DATE:         2020–07–19
NEW_COLUMNS:  [
               (0, ‘_id’, ‘TEXT’, 0, None, 0),
               (1, ‘distance’, ‘REAL’, 0, None, 0),
               (2, ‘g’, ‘REAL’, 0, None, 0),
               (3, ‘orbital_period’, ‘REAL’, 0, None, 0),
               (4, ‘avg_temp’, ‘REAL’, 0, None, 0),
               (5, ‘date_added’, ‘TEXT’, 0, None, 0),
               (6, ‘eccentricity’, ‘REAL’, 0, None, 0),
               (7, ‘atmosphere’, ‘TEXT’, 0, None, 0)
              ]
PAST_COLUMNS: [
               (0, ‘_id’, ‘TEXT’, 0, None, 0),
               (1, ‘distance’, ‘REAL’, 0, None, 0),
               (2, ‘g’, ‘REAL’, 0, None, 0),
               (3, ‘orbital_period’, ‘REAL’, 0, None, 0),
               (4, ‘avg_temp’, ‘REAL’, 0, None, 0),
               (5, ‘date_added’, ‘TEXT’, 0, None, 0)
              ]


With this query, we return the offending date: 2020–07–19. Like freshness and distribution observability, achieving schema observability follows a pattern: we identify the?useful metadata?that signals pipeline health, track it, and build detectors to alert us of potential issues. Supplying an additional table like?EXOPLANETS_COLUMNS?is one way to track schema, but there are many others. We encourage you to think about how you could implement a schema change detector for your own data pipeline!

 

Visualizing lineage

 
We’ve described lineage as?the most holistic?of the 5 pillars of data observability, and for good reason.

Lineage contextualizes incidents by telling us (1) which downstream sources may be impacted, and (2) which upstream sources may be the root cause. While it’s not intuitive to “visualize” lineage with SQL code, a quick example may illustrate how it can be useful.

 

For this, we’ll need to expand our data environment once again.

 

Introducing: HABITABLES

 
Let’s add another table to our database. So far, we’ve been recording data on exoplanets. Here’s one fun question to ask: how many of these planets may harbor life?

The?HABITABLES?table takes data from?EXOPLANETS?to help us answer that question:

sqlite> PRAGMA TABLE_INFO(HABITABLES);
0 | _id          | TEXT | 0 | | 0
1 | perihelion   | REAL | 0 | | 0
2 | aphelion     | REAL | 0 | | 0
3 | atmosphere   | TEXT | 0 | | 0
4 | habitability | REAL | 0 | | 0
5 | min_temp     | REAL | 0 | | 0
6 | max_temp     | REAL | 0 | | 0
7 | date_added   | TEXT | 0 | | 0


An entry in?HABITABLES?contains the following:

0.?_id: A UUID corresponding to the planet.
1.?perihelion: The?closest distance?to the celestial body during an orbital period.
2.?aphelion: The?furthest distance?to the celestial body during an orbital period.
3.?atmosphere: The dominant chemical makeup of the planet’s atmosphere.
4.?habitability: A real number between 0 and 1, indicating how likely the planet is to harbor life.
5.?min_temp: The minimum temperature on the planet’s surface.
6.?max_temp: The maximum temperature on the planet’s surface.
7.?date_added: The date our system discovered the planet and added it automatically to our databases.

Like the columns in?EXOPLANETS, values for perihelion, aphelion, atmosphere, min_temp, and max_temp are allowed to be?NULL. In fact,?perihelion?and?aphelion?will be?NULL?for any?_id?in?EXOPLANETS?where?eccentricity?is?NULL, since you use orbital eccentricity to calculate these metrics. This explains why these two fields are always?NULL?in our older data entries:

sqlite> SELECT * FROM HABITABLES LIMIT 5;


So, we know that HABITABLES depends on the values in?EXOPLANETS?(or, equally,?EXOPLANETS_EXTENDED), and EXOPLANETS_COLUMNS does as well. A dependency graph of our database looks like this:

Figure

Image courtesy of?Monte Carlo.

 

Very simple lineage information, but already useful. Let’s look at an anomaly in?HABITABLES?in the context of this graph, and see what we can learn.

 

Investigating an anomaly

 
When we have a key metric, like habitability in?HABITABLES, we can assess the health of that metric in several ways. For a start, what is the average value of?habitability?for new data on a given day?

Looking at this data, we see that something is wrong. The average value for?habitability?is normally around 0.5, but it halves to around 0.25 later in the recorded data.

Figure

A distribution anomaly… but what caused it?

 

This is a clear distributional anomaly, but what exactly is going on? In other words, what is the?root cause?of this anomaly?

Why don’t we look at the?NULL?rate for habitability, like we did in?Part I?

Fortunately, nothing looks out of character here:

But this doesn’t look promising as the cause of our issue. What if we looked at another distributional health metric, the?rate of zero values?

Something seems evidently more amiss here:

Historically,?habitability?was virtually never zero, but at later dates it spikes up to nearly 40% on average. This has the detected effect of lowering the field’s average value.

Figure

A distribution anomaly… but what caused it?

 

We can adapt one of the distribution detectors we built in Part I to get the first date of appreciable zero rates in the habitability field:

I ran this query through the command line:

$ sqlite3 EXOPLANETS.db < queries/lineage/habitability-zero-rate-detector.sql
DATE_ADDED | HABITABILITY_ZERO_RATE | PREV_HABITABILITY_ZERO_RATE
2020–07–19 | 0.369047619047619      | 0.0


2020–07–19 was the first date the zero rate began showing anomalous results. Recall that this is the same day as the schema change detection in?EXOPLANETS_EXTENDED.?EXOPLANETS_EXTENDED?is upstream from?HABITABLES, so it’s very possible that these two incidents are related.

It is in this way that lineage information can help us identify the?root cause?of incidents, and move quicker towards resolving them. Compare the two following explanations for this incident in?HABITABLES:

  1. On 2020–07–19, the zero rate of the habitability column in the?HABITABLES?table jumped from 0% to 37%.
  2. On 2020–07–19, we began tracking two additional fields,?eccentricity?and?atmosphere, in the?EXOPLANETS?table. This had an adverse effect on the downstream table?HABITABLES, often setting the fields?min_temp?and?max_temp?to extreme values whenever?eccentricity?was not?NULL. In turn, this caused the?habitability?field spike in zero rate, which we detected as an anomalous decrease in the average value.

Explanation (1) uses just the fact that an anomaly took place. Explanation (2) uses lineage, in terms of dependencies between both tables and fields, to put the incident in context and determine the root cause. Everything in (2) is actually correct, by the way, and I encourage you to mess around with the environment to understand for yourself what’s going on. While these are just simple examples, an engineer equipped with (2) would be faster to?understand?and?resolve?the underlying issue, and this is all owed to proper observability.

 

What’s next?

 
Tracking schema changes and lineage can give you unprecedented visibility into the health and usage patterns of your data, providing vital contextual information about who, what, where, why, and how your data was used. In fact, schema and lineage are the two most important data observability pillars when it comes to understanding the downstream (and often real-world) implications of data downtime.

To summarize:

  • Observing our data’s?schema?means understanding the formal structure of our data, and when and how it changes.
  • Observing our data’s?lineage?means understanding the upstream and downstream dependencies in our pipeline, and putting isolated incidents in a larger context.
  • Both of these pillars of?data observability?involve tracking the proper metadata, and transforming our data in a way that makes anomalies understandable.
  • Better observability means?better understanding of why and how data breaks, reducing both time-to-detection and time-to-resolution.

We hope that this second installment of “Data Observability in Context” was useful.

Until Part III, here’s wishing you no data downtime!

Interested in learning more about Monte Carlo’s approach to data observability? Reach out to Ryan,?Barr, and the?Monte Carlo team.

 
Barr Moses is the CEO and Co-founder of Monte Carlo, a data observability company. Prior, she served as a VP of Operations at Gainsight.

Ryan Kearns is a data and machine learning engineer at Monte Carlo and a rising senior at Stanford University.

Original. Reposted with permission.

Related:



甲鱼补什么 什么时候阅兵 郑成功是什么朝代的 屎为什么是黑色的 花絮是什么意思
180度是什么角 农历六月初七是什么星座 花旦是什么意思 微商是什么 追忆是什么意思
口腔长期溃疡是什么原因引起的 没字去掉三点水念什么 脑袋进水什么意思 秦皇岛为什么叫秦皇岛 益母草能治什么病
将至是什么意思 甲状旁腺激素高吃什么药 介意是什么意思 甲状腺4级是什么意思 状元是什么官
青蛙是什么hcv8jop2ns2r.cn 显现是什么意思hcv7jop6ns4r.cn 29度穿什么衣服合适hcv9jop0ns5r.cn 脸麻是什么原因引起的hcv8jop4ns3r.cn CA是什么激素hcv8jop2ns0r.cn
81年的鸡是什么命hcv8jop5ns0r.cn 高血压早餐吃什么好zhiyanzhang.com 大小三阳是什么病hcv9jop2ns4r.cn 什么的黄瓜hcv9jop4ns5r.cn 2001年属蛇的是什么命sanhestory.com
玹字五行属什么hcv8jop1ns9r.cn ptt是什么hcv8jop5ns9r.cn 关节响是什么原因hcv8jop5ns4r.cn 全身皮肤痒是什么原因hcv9jop7ns9r.cn 梦见好多水是什么预兆hcv7jop6ns6r.cn
一月九号是什么星座hcv9jop6ns3r.cn 叶什么什么龙hcv8jop0ns8r.cn 李白有什么之称hcv9jop8ns3r.cn 豆沙馅可以做什么美食hcv9jop4ns6r.cn 榴莲什么季节成熟0735v.com
百度