[视频]国家食品药品监管总局：多措并举打击治理...

百度树立文化发展“新思想”新的伟大的实践必然产生新的伟大的理论，新的伟大的理论又必将指导新的伟大的实践，新时代中国特色社会主义文化创新需要新的指导思想。

DuckDB combines the simplicity and ease of use of SQLite with the analytical performance of specialized columnar databases. Learn more with Python examples.

By Abid Ali Awan, KDnuggets Assistant Editor on July 11, 2023 in SQL

Image by Author

What is DuckDB?

DuckDB is a free, open-source, embedded database management system designed for data analytics and online analytical processing. This means several things:

It's free and open-source software, so anyone can use and modify the code.
It's embedded, meaning the DBMS (database management system) runs in the same process as the application that uses it. This makes it fast and simple to use.
It's optimized for data analytics and OLAP (online analytical processing), not just transactional data like typical databases. This means the data is organized by columns instead of rows to optimize aggregation and analysis.
It supports standard SQL so you can run queries, aggregations, joins, and other SQL functions on the data.
It runs in-process, within the application itself rather than as a separate process. This eliminates the overhead of communicating between processes.
Like SQLite, it's a simple, file-based database so there's no separate server installation required. You just include the library in your application.

In summary, DuckDB provides an easy-to-use, embedded analytic database for applications that need fast and simple data analysis capabilities. It fills a niche for analytical processing where a full database server would be overkill.

Why is DuckDB Getting Popular?

There are many reasons companies are now building products on top of DuckDB. The database is designed for fast analytical queries which means it's optimized for aggregations, joins, and complex queries on large datasets - the types of queries often used in analytics and reporting. Moreover:?

It's simple to install, deploy, and use. There is no server to configure - DuckDB runs embedded within your application. This makes it easy to integrate into different programming languages and environments.
Despite its simplicity, DuckDB has a rich feature set. It supports the full SQL standard, transactions, secondary indexes, and integrates well with popular data analysis programming languages like Python and R.
DuckDB is free for anyone to use and modify, which lowers the bar for developers and data analysts to adopt it.
DuckDB is well-tested and stable. It has an extensive test suite and is continuously integrated and tested on a variety of platforms to ensure stability.
DuckDB offers comparable performance to specialized OLAP databases while being easier to deploy. This makes it suitable for both analytical queries on small to medium datasets as well as large enterprise datasets.

In short, DuckDB combines the simplicity and ease of use of SQLite with the analytical performance of specialized columnar databases. All of these factors - performance, simplicity, features, and open source license - contribute to DuckDB's growing popularity among developers and data analysts.

DuckDB Python Example

Let’s test out a few features of DuckDB using the Python API.?

You can instal DuckDB using Pypi:

pip install duckdb

For other programming language, head to the DuckDB’s installation guide.?

In this example, we will be using Data Science Salaries 2023 CSV dataset from Kaggle and try to test DuckDB’s various functionalities.?

Relation API

You can load a CSV file just like pandas into a relation. DuckDB provides a relational API that allows users to link query operations together. The queries are lazily evaluated, which enables DuckDB to optimize their execution.

We have loaded the data science salary dataset and displayed the alias.?

import duckdb
rel = duckdb.read_csv('ds_salaries.csv')
rel.alias

'ds_salaries.csv'

To display the column names we will use .columns similar to pandas.?

rel.columns

['work_year',
 'experience_level',
 'employment_type',
 'job_title',
 'salary',
 'salary_currency',
 'salary_in_usd',
 'employee_residence',
 'remote_ratio',
 'company_location',
 'company_size']

You can apply multiple functions to the relation to get specific results. In our case, we have filtered out “work_year”, displayed only three columns, and ordered and limited them to display the bottom five job titles based on the salaries.?

Learn more about Relational API by following the guide.?

rel.filter("work_year > 2021").project(
    "work_year,job_title,salary_in_usd"
).order("salary_in_usd").limit(5)

┌───────────┬─────────────────┬───────────────┐
│ work_year │    job_title    │ salary_in_usd │
│   int64   │     varchar     │     int64     │
├───────────┼─────────────────┼───────────────┤
│      2022 │ NLP Engineer    │          5132 │
│      2022 │ Data Analyst    │          5723 │
│      2022 │ BI Data Analyst │          6270 │
│      2022 │ AI Developer    │          6304 │
│      2022 │ Data Analyst    │          6359 │
└───────────┴─────────────────┴───────────────┘

You can also use Relational API to join two datasets. In our case, we are joining the same dataset by changing the alias name on a “job_title”.?

rel2 = duckdb.read_csv('ds_salaries.csv')
rel.set_alias('a').join(rel.set_alias('b'), 'job_title').limit(5)

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐
│ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │
│   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │
├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤
│      2023 │ SE               │ FT              │ ... │          100 │ US               │ L            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ SE               │ FT              │ ... │          100 │ US               │ S            │
│      2023 │ SE               │ FT              │ ... │          100 │ US               │ S            │
├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤
│ 5 rows                                                                         21 columns (6 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

Direct SQL method

There are direct methods too. You just have to write SQL query to perform analysis on the dataset. Instead of the table name, you will write the location and name of the CSV file.?

duckdb.sql('SELECT * FROM "ds_salaries.csv" LIMIT 5')

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐
│ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │
│   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │
├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤
│      2023 │ SE               │ FT              │ ... │          100 │ ES               │ L            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │
│      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │
├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤
│ 5 rows                                                                         11 columns (6 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

Persistent Storage

By default, DuckDB operates on an in-memory database. This means that any tables created are stored in memory and not persisted to disk. However, by using the .connect() method, a connection can be made to a persistent database file on disk. Any data written to that database connection will then be saved to the disk file and reloaded when reconnecting to the same file.

We will create a database by using .connect() method.
Run an SQL query to create a table.
Use Query to add two records.
Display the newly created test table.?

import duckdb

con = duckdb.connect('kdn.db')

con.sql("CREATE TABLE test_table (i INTEGER, j STRING)")
con.sql("INSERT INTO test_table VALUES (1, 'one'),(9,'nine')")
con.table('test_table').show()

┌───────┬─────────┐
│   i   │    j    │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ one     │
│     9 │ nine    │
└───────┴─────────┘

We can also create the new table using a data science salary CSV file.?

con.sql('CREATE TABLE ds_salaries AS SELECT * FROM "ds_salaries.csv";')
con.table('ds_salaries').limit(5).show()

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐
│ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │
│   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │
├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤
│      2023 │ SE               │ FT              │ ... │          100 │ ES               │ L            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │
│      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │
│      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │
├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤
│ 5 rows                                                                         11 columns (6 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

After performing all the tasks, you must close the connection to the database.?

con.close()

Conclusion

Why do I like DuckDB? It is fast and simple to learn and manage. I believe simplicity is the main reason DuckDB has become widely used in the data science community. DuckDB provides an intuitive SQL interface that is easy for data analysts and scientists to pick up. Installation is straightforward, and the database files are light and manageable. All of these make DuckDB a joy to use.

Check out my previous Deepnote article on Data Science with DuckDB for an in-depth analysis of features and use cases.

With robust tools for data loading, managing, and analysis, DuckDB offers an attractive option compared to other database solutions for data science. I believe DuckDB will continue gaining users in the coming years as more data professionals discover its user-friendly nature.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

meq是什么单位	9.3号是什么星座	八十岁叫什么之年	为什么不建议吃茵栀黄	为什么养鱼养单不养双
公检法是什么	上海最高楼叫什么大厦有多少米高	口腔溃疡什么症状	情感障碍是什么意思	什么运动能长高
肺部有阴影一般是什么病	红细胞是什么意思	肠胃不好吃什么菜比较好	更年期什么时候	rna是什么意思
什么是囊肿	转氨酶高说明什么	灰菜有什么功效与作用	ca19-9偏高是什么意思	婚检是什么意思

叫爸爸是什么意思hcv9jop0ns1r.cn	怀孕梦见蛇是什么意思hcv8jop5ns3r.cn	检察院是干什么的gangsutong.com	腊月初六是什么星座hcv9jop8ns1r.cn	第一次怀孕有什么反应bfb118.com
推杯换盏什么意思hcv9jop3ns5r.cn	子宫癌筛查做什么检查hcv9jop7ns2r.cn	消化不良的症状吃什么药hcv8jop8ns9r.cn	宝宝蛋白质过敏喝什么奶粉hcv8jop9ns4r.cn	骨折喝什么汤恢复得快hcv8jop5ns1r.cn
为什么来完月经下面痒hcv9jop1ns6r.cn	岁月如歌是什么意思hcv8jop0ns1r.cn	梦见芹菜是什么意思hcv7jop6ns1r.cn	控诉是什么意思hcv8jop8ns7r.cn	胃饱胀是什么原因hcv8jop5ns1r.cn
嗜酸性粒细胞偏低是什么原因hcv8jop7ns0r.cn	碳素笔是什么笔hcv7jop6ns8r.cn	重字五行属什么hcv8jop0ns7r.cn	部队班长是什么军衔hcv7jop9ns1r.cn	七月属什么生肖hcv7jop9ns3r.cn

[视频]国家食品药品监管总局：多措并举打击治理...

What is DuckDB?

Why is DuckDB Getting Popular?

DuckDB Python Example

Relation API

Direct SQL method

Persistent Storage

Conclusion

More On This Topic

Latest Posts

Top Posts

[视频]国家食品药品监管总局：多措并举 打击治理...

What is DuckDB?

Why is DuckDB Getting Popular?

DuckDB Python Example

Relation API

Direct SQL method

Persistent Storage

Conclusion

More On This Topic

Latest Posts

Top Posts

[视频]国家食品药品监管总局：多措并举打击治理...