金鱼藻是什么植物| 肺气肿用什么药效果好| 蝙蝠吃什么食物| 人的心脏在什么位置| 肾阴亏虚吃什么药| 月经周期变短是什么原因| 嗓子咽口水疼吃什么药| 着凉了吃什么药| 拔罐什么时候拔最好| 惊弓之鸟告诉我们什么道理| 什么品牌的冰箱好| 什么是肺结核| 阿尔茨海默症吃什么药| 量程是什么| 一个鱼一个完读什么| 提莫是什么意思| 十月二十三号是什么星座| 我俩太不公平这是什么歌| 什么是乳胶床垫| 香港有什么东西值得买| 桃胶有什么功效与作用| 硬笔是什么笔| 大姨妈来了可以吃什么水果| 榴莲为什么那么贵| 肺炎吃什么药效果好| 吃什么生血| 哺乳期发烧吃什么药| 风水是什么意思| 小孩子包皮挂什么科| feno是什么检查| 铁皮石斛能治什么病| 高血压吃什么菜| 腹腔淋巴结肿大是什么原因| 房颤是什么病严重吗| 什么酒最贵| 系少一撇叫什么| 燕窝适合什么人吃| 汉族是什么人种| 谷丙转氨酶是检查什么的| 咖啡加奶有什么坏处和好处| 平均血小板体积偏低是什么意思| 玉兰花什么季节开| 小暑是什么意思啊| 豆米是什么| 包饺子用什么肉| 锡兵是什么意思| phe是什么氨基酸| 维生素d低是什么原因| 战国时期是什么时候| 抑制剂是什么| 天津市市长是什么级别| 哈尔滨机场叫什么名字| 忌口不能吃什么| 12月7日是什么星座| 人体含量最多的元素是什么| kb是什么| 恍然大悟什么意思| 是什么样的感觉我不懂是什么歌| 背水一战什么意思| 胃肠功能紊乱吃什么药| 三联律是什么意思| 宝宝手足口病吃什么药| 牙龈萎缩吃什么药| 银屑病用什么药膏| 大腿外侧什么经络| 金融数学学什么| 盐碱地适合种什么农作物| 狮子长什么样| 右膝关节退行性变是什么意思| 血管检查什么方法最好| 头发为什么长不长| 4月3日什么星座| 梅毒是什么意思| 内能与什么有关| 间接是什么意思| 葡萄代表什么生肖| 大小便失禁是什么原因造成的| 蟾宫是什么意思| 检查尿液能查出什么病| 付之东流是什么意思| 大便隐血弱阳性是什么意思| 夜里睡觉手麻是什么原因| 心脏早搏吃什么药最好| 肺部真菌感染吃什么药| 生日派对是什么意思| 非萎缩性胃炎吃什么药| 下面外面瘙痒用什么药| 慷慨什么| 汉族人是什么人种| 普罗帕酮又叫什么| 降压药的原理是什么| 祸起萧墙的萧墙指什么| 梦见自己会开车了是什么意思| 热毒吃什么药| 医保和农村合作医疗有什么区别| 四相是什么意思| 鱼油有什么好处| 6月27是什么星座| 脂肪是什么意思| 眼睛肿了用什么药| 1027是什么星座| 店里来猫是什么兆头| 为什么会岔气| 升阳举陷是什么意思| 什么是螨虫型痘痘图片| 白色裤子配什么上衣| 肾病有什么症状男性| 经常早上肚子疼是什么原因| 朱砂是什么意思| 见人说人话见鬼说鬼话是什么意思| 在什么| 淋巴细胞百分比高是什么原因| 针眼长什么样子图片| 胃痛按什么部位可以缓解疼痛| 水瓶座什么性格| 肉桂是什么茶| 肛门潮湿瘙痒用什么药最好| 腿发麻是什么原因| 因果报应是什么意思| 感冒什么时候传染性最强| 熙字五行属什么| 立牌坊是什么意思| 师团长是什么级别| 老是犯困想睡觉是什么原因| 素土是什么| 不睡人的空床放点什么| 什么发型好看| 梦见男婴儿是什么意思| 摇呼啦圈有什么好处| 为什么会得糖尿病| 蒲公英长什么样子| 肠胃炎是什么症状| 住院需要带什么| 什么原因造成低血糖| 中国最毒的蛇是什么蛇| 替班是什么意思| 存在感是什么意思| 人生三件大事是指什么| 什么叫窦性心律不齐| 嗓子疼有痰吃什么药| 这个季节吃什么水果| 梦里见血代表什么预兆| 蜜蜂蛰了用什么药| 全身发麻是什么原因| 糖宝是什么意思| 三个毛念什么| 破执是什么意思| 做月子可以吃什么| 脑动脉硬化吃什么药| 9.1号是什么星座| 沈阳有什么好玩的地方| 七零年属什么生肖| 定妆用什么好| 睡眠不好用什么药调理| 阿胶糕什么人不能吃| 黑色车牌是什么车| 为什么腰痛| 满城尽带黄金甲是什么意思| 结核有什么症状| 广州立冬吃什么| 脂肪酸是什么| 添堵是什么意思| 沉迷是什么意思| 公测是什么意思| 天启是什么意思| 木薯是什么东西图片| 副县长是什么级别| 侧面是什么意思| 覆水难收什么意思| 吃什么化痰效果最好最快| 猫猴子是什么| 打强心针意味着什么| 胸闷气短是什么原因造成的| 失眠吃什么中药调理效果快| 异曲同工是什么意思| 做梦是什么原因造成的| 肾尿盐结晶是什么意思| 什么精神| 什么是PC出轨| 下水是什么意思| 打车费计入什么科目| 茶叶过期了有什么用途| 学士学位证书有什么用| ag医学上是什么意思| 副高是什么职称| navigare是什么牌子| 卡地亚蓝气球什么档次| 肾看什么科| pr是什么| arb是什么意思| 负氧离子是什么| 睫毛炎有什么症状| 2050年是什么年| 无花果有什么作用| 再接再厉什么意思| 吃什么生精养精最快| 腰痛宁为什么晚上吃| honor是什么牌子的手机| 六月十一是什么星座| 梦见奶奶死了是什么意思| 月令是什么意思| 一碗香是什么菜| 孩子喉咙痛吃什么药好| qq黄钻有什么用| 限期使用日期是什么意思| 酒鬼酒是什么香型| 男人送女人项链代表什么| 扬字五行属什么| 女人打掉孩子说明什么| 代糖是什么| 雪莲果什么时候成熟| 两肺纹理增多是什么意思| 为什么突然长癣了| 下面有炎症用什么药| 类风湿是什么原因引起的| 鸡头米是什么东西| flair是什么意思| 鼠分念什么| 吃什么食物对头发好| 霉菌阴道炎是什么引起的| 一岁宝宝流鼻涕吃什么药| 疖肿吃什么药| 什么的遗产| 用印是什么意思| 中午吃什么不会胖| 水车是什么意思| 什么什么望外| 什么茶有助于睡眠| 什么体质的人才有季经| gala是什么意思| 睡前吃香蕉有什么好处| 头皮脂溢性皮炎用什么洗发水| 吃什么卵泡长得快又圆| 小肠疝气挂什么科| 处暑是什么意思| 泡菜生花用什么方法可以去掉| 摄取是什么意思| 便溏吃什么药| 大脚趾外翻是什么原因| nfl是什么意思| 乳头经常痒是什么原因| adhd挂什么科| 什么牌子的笔记本电脑好| 阑尾切除后有什么影响和后遗症| gv是什么意思| 性张力是什么意思| 维民所止什么意思| 宝宝为什么老是吐奶| 6月30号是什么星座| 早上起来后背疼是什么原因| 右边脑袋疼是什么原因| 浪凡算是什么档次的| dha什么时间段吃最好| 电解工是干什么的| 人为什么会困| 继发性不孕是什么意思| 多巴胺什么意思| 口腔异味是什么原因引起的| 吃什么食物补气血| 经期适合喝什么茶| 右附件区囊肿是什么意思| 眼睛疼吃什么药| 紫色是什么颜色调出来的| 吃什么发胖最快| 百度
 

2018年“欢乐春节”:向世界推送新时代文化精品

百度 《通知》要求,严格招生管理和违规查处。

To trigger an alert when data breaks, data teams can leverage a tried and true tactic from our friends in software engineering: monitoring and observability. In this article, we walk through how you can create your own data quality monitors for freshness and distribution from scratch using SQL.



By Ryan Kearns, Stanford University & Barr Moses, CEO and Co-founder of Monte Carlo

Figure

Image courtesy of?faaiq ackmerd?on?Pexels.

 

In this article series, we walk through how you can create your own data observability monitors from scratch, mapping to?five key pillars of data health. Part 1 of this series was adapted from Barr Moses and Ryan Kearns’ O’Reilly training,?Managing Data Downtime: Applying Observability to Your Data Pipelines, the industry’s first-ever course on data observability. The associated exercises are available?here, and the adapted code shown in this article is available?here.

From null values and?duplicate rows, to modeling errors and schema changes, data can break for many reasons.?Data testing?is often our first line of defense against bad data, but what happens if data breaks during its life cycle?

We call this phenomenon data downtime, and it refers to periods of time where data is missing, erroneous, or otherwise inaccurate.?Data downtime?prompts us to ask questions such as:

  • Is the data up to date?
  • Is the data complete?
  • Are fields within expected ranges?
  • Is the null rate higher or lower than it should be?
  • Has the schema changed?

To trigger an alert when data breaks and prevent data downtime, data teams can leverage a tried and true tactic from our friends in software engineering:?monitoring and observability.

We define?data observability?as an organization’s ability to answer these questions and assess the health of their data ecosystem. Reflecting key variables of data health, the five pillars of data observability are:

  • Freshness: is my data up to date? Are there gaps in time where my data has not been updated?
  • Distribution: how healthy is my data at the field-level? Is my data within expected ranges?
  • Volume: is my data intake meeting expected thresholds?
  • Schema: has the formal structure of my data management system changed?
  • Lineage: if some of my data is down, what is affected upstream and downstream? How do my data sources depend on one another?

It’s one thing to talk about data observability in this conceptual way, but a complete treatment should pull back the curtain —?what does data observability actually look like, under the hood, in the code?

 

It’s difficult to answer this question entirely, since the details will depend on one’s choice of data warehouse, data lake, BI tools, preferred languages and frameworks, and so on. Even so, addressing these problems using lightweight tools like SQLite and Jupyter could be useful.

In this article, we walk through an example data ecosystem to create our own data quality monitors in SQL and explore what data observability looks like in practice.

Let’s take a look.

 

Data Observability in practice

 
This tutorial is based on?Exercise 1?of our O’Reilly course,?Managing Data Downtime. You’re welcome to try out these exercises on your own using a Jupyter Notebook and SQL. We’ll be going into more detail, including exercises?2,?3?and?4, in future articles.

Our sample data ecosystem uses?mock astronomical data?about habitable exoplanets. For the purpose of this exercise, I generated the dataset with Python, modeling anomalies off of real incidents I’ve come across in production environments. This dataset is entirely free to use, and the?utils folder?in the repository contains the code that generated the data, if you’re interested.

I’m using?SQLite 3.32.3, which should make the database accessible from either the command prompt or SQL files with minimal setup. The concepts extend to really any query language, and?these implementations?can be extended to MySQL, Snowflake, and other database environments with minimal changes.

$ sqlite3 EXOPLANETS.db
sqlite> PRAGMA TABLE_INFO(EXOPLANETS);
0 | _id            | TEXT | 0 | | 0
1 | distance       | REAL | 0 | | 0
2 | g              | REAL | 0 | | 0
3 | orbital_period | REAL | 0 | | 0
4 | avg_temp       | REAL | 0 | | 0
5 | date_added     | TEXT | 0 | | 0


A database entry in?EXOPLANETS?contains the following info:
0._id: A UUID corresponding to the planet.
1.?distance: Distance from Earth, in lightyears.
2.?g: Surface gravity as a multiple of?g, the gravitational force constant.
3.?orbital_period: Length of a single orbital cycle in days.
4.?avg_temp: Average surface temperature in degrees Kelvin.
5.?date_added: The date our system discovered the planet and added it automatically to our databases.

Note that one or more of?distance,?g,?orbital_period, and?avg_temp?may be?NULL?for a given planet as a result of missing or erroneous data.

sqlite> SELECT * FROM EXOPLANETS LIMIT 5;


Note that this exercise is retroactive — we’re looking at historical data. In a production data environment, data observability is real time and applied at each stage of the data life cycle, and thus will involve a slightly different implementation than what is done here.

For the purpose of this exercise, we’ll be building data observability algorithms for freshness and distribution, but in future articles, we’ll address the rest of our five pillars — and more.

 

Freshness

 
The first pillar of data observability we monitor for is freshness, which can give us a strong indicator of when critical data assets were last updated. If a report that is regularly updated on the hour suddenly looks very stale, this type of anomaly should give us a strong indication that something is off.

First, note the?DATE_ADDED?column. SQL doesn’t store metadata on when individual records are added. So, to visualize freshness in this retroactive setting, we need to track that information ourselves.

Grouping by the?DATE_ADDED?column can give us insight into how?EXOPLANETS?updates daily. For example, we can query for the number of new IDs added per day:

You can run this yourself with?$ sqlite3 EXOPLANETS.db < queries/freshness/rows-added.sql?in?the repository. We get the following data back:

Based on this graphical representation of our dataset, it looks like?EXOPLANETS?consistently updates with around 100 new entries each day, though there are gaps where no data comes in for multiple days.

Recall that with freshness, we want to ask the question “is my data up to date?” — thus, knowing about those gaps in table updates is essential to understanding the reliability of our data.

Figure

Freshness anomalies!

 

This query operationalizes freshness by introducing a metric for?DAYS_SINCE_LAST_UPDATE. (Note: since this tutorial uses SQLite3, the SQL syntax for calculating time differences will be different in MySQL, Snowflake, and other environments).

The resulting table says “on date?X, the most recent data in?EXOPLANETS?was?Y?days old.” This is information not explicitly available from the?DATE_ADDED?column in the table — but applying data observability gives us the tools to uncover it.

Image for post
 

Now, we have the data we need to detect freshness anomalies. All that’s left to do is to set a?threshold?parameter?for Y —?how many days old is too many? A parameter turns a query into a detector, since it decides what counts as anomalous (read: worth alerting) and what doesn’t. (More on setting threshold parameters in a later article!).

Freshness anomalies!

The data returned to us represents dates where freshness incidents occurred.

On 2020–05–14, the most recent data in the table was 8 days old! Such an outage may represent a breakage in our data pipeline, and would be good to know about if we’re using this data for anything worthwhile (and if we’re using this in a production environment, chances are, we are).

Image for post
 

Note in particular the last line of the query:?DAYS_SINCE_LAST_UPDATE > 1;.

Here, 1 is a?model parameter?— there’s nothing “correct” about this number, though changing it will impact what dates we consider to be incidents. The smaller the number, the more genuine anomalies we’ll catch (high?recall), but chances are, several of these “anomalies” will not reflect real outages. The larger the number, the greater the likelihood all anomalies we catch will reflect true anomalies (high?precision), but it’s possible we may miss some.

For the purpose of this example, we could change 1 to 7 and thus only catch the two worst outages on 2020–02–08 and 2020–05–14. Any choice here will reflect the particular use case and objectives, and is an important balance to strike that comes up again and again when applying data observability at scale to production environments.

Below, we leverage the same freshness detector, but with?DAYS_SINCE_LAST_UPDATE > 3;?serving as the threshold. Two of the smaller outages now go undetected.

Image for post
 

Note the two undetected outages — these must be fewer than 3-day gaps.

Now we visualize the same freshness detector, but with?DAYS_SINCE_LAST_UPDATE > 7;?now serving as the threshold. All but the two largest outages now go undetected.

Image for post
 

Just like planets, optimal model parameters sit in a “Goldilocks Zone” or “sweet spot” between values considered too low and too high. These data observability concepts (and more!) will be discussed in a later article.

 

Distribution

 
Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data?NULL”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.

This query returns a lot of data! What’s going on?

The general formula?CAST(SUM(CASE WHEN SOME_METRIC IS NULL THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*), when grouped by the?DATE_ADDED?column, is telling us the rate of?NULL?values for?SOME_METRIC?in the daily batches of new data in?EXOPLANETS. It’s hard to get a sense by looking at the raw output, but a visual can help illuminate this anomaly:

Image for post

Image for post

Image for post

Image for post

The visuals make it clear that there are null rate “spike” events we should be detecting. Let’s focus on just the last metric,?AVG_TEMP, for now. We can detect null spikes most basically with a?simple threshold:

Our first distribution anomalies.

As detection algorithms go, this approach is something of a blunt instrument. Sometimes, patterns in our data will be simple enough for a threshold like this to do the trick. In other cases, though, data will be noisy or have other complications, like?seasonality, requiring us to change our approach.

Image for post

For example, detecting 2020–06–02, 2020–06–03, and 2020–06–04 seems redundant. We can filter out dates that occur immediately after other alerts:

Note that in both of these queries, the key parameter is?0.9. We’re effectively saying: “any null rate higher than 90% is a problem, and I need to know about it.”

Image for post

In this instance, we can (and should) be a bit more intelligent by applying the concept of?rolling average?with a more intelligent parameter:

One clarification: notice that on line 28, we filter using the quantity?AVG_TEMP_NULL_RATE — TWO_WEEK_ROLLING_AVG. In other instances, we might want to take the?ABS()?of this error quantity, but not here — the reason being that a?NULL?rate “spike” is much more alarming if it represents an increase from the previous average. It may not be worthwhile to monitor whenever?NULLs abruptly decrease in frequency, while the value in detecting a?NULL?rate increase is clear.

Image for post

There are, of course, increasingly sophisticated metrics for anomaly detection like?Z-scores?and?autoregressive modeling?that are out of scope here. This tutorial just provides the basic scaffolding for field-health monitoring in SQL; I hope it can give you ideas for your own data!

 

What’s next?

 
This brief tutorial intends to show that “data observability” is not as mystical as the name suggests, and with a holistic approach to understanding your data health, you can ensure high data trust and reliability at every stage of your pipeline.

In fact, the core principles of data observability are achievable using plain SQL “detectors,” provided some key information like record timestamps and historical table metadata are kept. It’s also worth noting that key ML-powered parameter tuning is mandatory for end-to-end data observability systems that grow with your production environment.

Stay tuned for future articles in this series that focus on monitoring anomalies in distribution and schema, the role of lineage and metadata in data observability, and how to monitor these pillars together at scale to achieve more reliable data.

 

Until then — here’s wishing you no data downtime!

Interested in learning more about how to apply data observability at scale? Reach out to?Ryan,?Barr?, and the rest of the?Monte Carlo team.

 
Ryan Kearns is a rising senior at Stanford University double majoring in Computer Science and Philosophy. He's currently an ML engineering intern at Monte Carlo.

Barr Moses is the CEO and Co-founder of Monte Carlo, a data observability company. Prior, she served as a VP of Operations at Gainsight.

Original. Reposted with permission.

Related:



带状疱疹是什么引起的 吃什么营养神经 什么是睡眠障碍 耳朵大代表什么 睡不着吃什么药最有效
老鹰代表什么生肖 颈椎病吃什么药效果好 眼睛红血丝用什么眼药水 产检建档需要什么资料 市政协副主席是什么级别
羟基丁酸在淘宝叫什么 小三阳吃什么食物好得快 6像什么 空降兵属于什么兵种 什么叫做质量
唐僧姓什么 小狗能看见什么颜色 瓜子脸剪什么发型好看 休学是什么意思 梦遗太频繁是什么原因造成的
喝最烈的酒下一句是什么hcv8jop9ns0r.cn jojo是什么hcv9jop3ns4r.cn 气血不足吃什么补最快hcv8jop3ns9r.cn 天干指的是什么hcv7jop6ns9r.cn 海龟汤是什么hcv8jop5ns7r.cn
痛风是什么地方痛hcv8jop9ns3r.cn 胆囊结晶是什么意思hcv8jop8ns0r.cn 电泳是什么hcv8jop8ns9r.cn 723是什么意思hcv7jop7ns1r.cn 17年属什么hcv7jop6ns2r.cn
不谷是什么意思hcv9jop0ns1r.cn 什么什么不生hcv8jop5ns0r.cn 上火引起的喉咙痛吃什么药hcv8jop2ns4r.cn 加拿大现在什么季节xianpinbao.com 恒心是什么意思beikeqingting.com
减肥应该吃什么hcv9jop0ns2r.cn 白细胞高是什么原因hcv7jop5ns5r.cn 嘴角开裂是什么原因hcv9jop6ns5r.cn 九四年属什么生肖hcv9jop5ns8r.cn 凤凰单丛茶属于什么茶hcv8jop0ns4r.cn
百度