{ "cells": [ { "cell_type": "markdown", "id": "cd4e580a-8b51-417c-9017-83723cf794d7", "metadata": {}, "source": [ "# duckdb,单机高效处理千万级以上数据" ] }, { "cell_type": "markdown", "id": "6e3382e6-8552-48e6-95ef-6beaae6b7be6", "metadata": {}, "source": [ "当你用pandas处理超过几百万甚至几千万上亿行的数据感觉太慢,又不太想搞pyspark,不妨试试duckdb.\n", "\n", "它在本地单机运行,可以像spark那样使用sql语句进行数据分析和数据转换,并且性能非常高。\n", "\n", "当处理几千万行以上的数据时,它的效率通常是pandas的几十几百倍。\n", "\n", "难能可贵的是,这个库的用法非常简单,核心API只有以下几个。\n", "\n", "\n", "```python\n", "import duckdb\n", "\n", "#输入建表:可以从parquet,pandas dataframe建表\n", "tb = duckdb.read_parquet('input_data.parquet') \n", "tb = duckdb.from_df(df) #从pandas转换\n", "\n", "#分析转换:执行sql,支持自定义函数UDF\n", "tb2 = duckdb.sql('select * from tb where val>1000 and val<10000 order by val')\n", "duckdb.create_function('myfunc',myfunc,[duckdb.typing.VARCHAR],duckdb.typing.VARCHAR) \n", "\n", "#输出落表: 推荐导出成parquet格式,效率最高\n", "tb2.to_parquet('output_data.parquet')\n", "df2 = tb2.to_df() #也可以转换成pandas\n", "```\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "5973d4f0-e198-4ab2-b825-87b467de7c74", "metadata": {}, "outputs": [], "source": [ "!pip install duckdb -i https://pypi.tuna.tsinghua.edu.cn/simple " ] }, { "cell_type": "code", "execution_count": 76, "id": "8d1fefd7-e702-4390-878f-f11eaabb670f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple\n", "Collecting Faker\n", " Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a7/0b/57ac98ca6aa8cd54246b440bc2aa8549e5dcbb99a72a399cab6d66c7f198/Faker-26.0.0-py3-none-any.whl (1.8 MB)\n", "\u001b[K |████████████████████████████████| 1.8 MB 3.2 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: python-dateutil>=2.4 in /usr/local/lib/python3.8/dist-packages (from Faker) (2.8.2)\n", "Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.4->Faker) (1.14.0)\n", "Installing collected packages: Faker\n", "Successfully installed Faker-26.0.0\n" ] } ], "source": [ "!pip install Faker -i https://pypi.tuna.tsinghua.edu.cn/simple " ] }, { "cell_type": "markdown", "id": "3c7a5965-1689-4202-a34e-5a6bf89b3945", "metadata": {}, "source": [ "## 一,输入建表" ] }, { "cell_type": "code", "execution_count": 1, "id": "2f22ebf0-fc42-45c6-a5c0-afb46e4201c7", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import duckdb " ] }, { "cell_type": "code", "execution_count": 2, "id": "f10cc2ea-d11c-408f-9d2a-206062ca3300", "metadata": {}, "outputs": [], "source": [ "#生成示例数据(5000万行)\n", "dfdemo = pd.DataFrame(\n", " {\n", " 'category': np.random.choice(list('ABCDEF'), 50000000),\n", " 'val': np.round(np.random.uniform(0, 1000000, 50000000), 3)\n", " }\n", ")\n", "dfdemo.to_parquet('dfdemo.parquet', index=False)\n" ] }, { "cell_type": "markdown", "id": "1e63d041-9963-4b4a-a20c-7d3a3ae30659", "metadata": {}, "source": [ "对比读取数据速度,duckdb比pandas快几十倍" ] }, { "cell_type": "code", "execution_count": 3, "id": "5ee8c204-8484-4775-a4ef-e3b1fee1583f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.48 s, sys: 497 ms, total: 2.98 s\n", "Wall time: 2.08 s\n" ] } ], "source": [ "%%time \n", "df = pd.read_parquet('dfdemo.parquet')" ] }, { "cell_type": "code", "execution_count": 4, "id": "9760a672-bd8a-4a61-ab5c-f1e99dc83fd7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.99 ms, sys: 31.5 ms, total: 33.5 ms\n", "Wall time: 77.1 ms\n" ] } ], "source": [ "%%time\n", "tb = duckdb.read_parquet('dfdemo.parquet')" ] }, { "cell_type": "code", "execution_count": 5, "id": "f40cf705-f550-4f96-a6e6-ba0be474efad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['category', 'val']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tb.columns " ] }, { "cell_type": "markdown", "id": "e6aa225b-d18b-4158-a409-93b4ab1ccf64", "metadata": {}, "source": [ "## 二,SQL分析" ] }, { "cell_type": "markdown", "id": "580021f6-19a2-4ab8-9323-f2f303569be2", "metadata": {}, "source": [ "### 1, 基本查询" ] }, { "cell_type": "markdown", "id": "0e6f3dbf-fc71-4bf8-8cc3-209c8bb1a466", "metadata": {}, "source": [ "#### where查询对比 (duckdb比pandas快20倍)" ] }, { "cell_type": "code", "execution_count": 6, "id": "abfe1dec-357f-45f7-9587-957c51b1487b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 496 µs, sys: 1.22 ms, total: 1.72 ms\n", "Wall time: 20.4 ms\n" ] }, { "data": { "text/plain": [ "┌──────────┬──────────┐\n", "│ category │ val │\n", "│ varchar │ double │\n", "├──────────┼──────────┤\n", "│ F │ 1000.002 │\n", "│ E │ 1000.113 │\n", "│ B │ 1000.115 │\n", "│ B │ 1000.122 │\n", "│ D │ 1000.125 │\n", "└──────────┴──────────┘" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time \n", "# where 查询\n", "duckdb.sql('select * from tb where val>1000 and val<10000 order by val limit 5')" ] }, { "cell_type": "code", "execution_count": 7, "id": "9c43033d-003b-4863-93c7-bf0e28879ee4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 296 ms, sys: 105 ms, total: 401 ms\n", "Wall time: 399 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categoryval
35699151F1000.002
37196599E1000.113
38865354B1000.115
14015201B1000.122
24024316D1000.125
\n", "
" ], "text/plain": [ " category val\n", "35699151 F 1000.002\n", "37196599 E 1000.113\n", "38865354 B 1000.115\n", "14015201 B 1000.122\n", "24024316 D 1000.125" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time \n", "df.loc[(df['val']>1000)&(df['val']<10000),:].sort_values('val').head(5)" ] }, { "cell_type": "markdown", "id": "89824428-333a-4def-b15e-dec2d5501d8a", "metadata": {}, "source": [ "#### groupby 操作对比 (duckdb比pandas快40倍)" ] }, { "cell_type": "code", "execution_count": 8, "id": "1ecd79c3-3fc8-4bfe-a741-866ba9223db1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.85 ms, sys: 17 ms, total: 18.8 ms\n", "Wall time: 50.2 ms\n" ] }, { "data": { "text/plain": [ "┌──────────┬─────────┬────────────────────┐\n", "│ category │ rows │ avg_val │\n", "│ varchar │ int64 │ double │\n", "├──────────┼─────────┼────────────────────┤\n", "│ B │ 8328359 │ 500029.4997295229 │\n", "│ A │ 8334826 │ 499895.93226725736 │\n", "│ D │ 8329469 │ 500127.047667208 │\n", "│ E │ 8334574 │ 499913.2748367591 │\n", "│ C │ 8334322 │ 499964.2243027494 │\n", "│ F │ 8338450 │ 500146.17542576144 │\n", "└──────────┴─────────┴────────────────────┘" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time \n", "# groupby 操作\n", "duckdb.sql('select category, count(val) as rows, mean(val) as avg_val from tb group by category')" ] }, { "cell_type": "code", "execution_count": 10, "id": "0b0f66f4-d41c-41ac-a40e-b8d860d54743", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.13 s, sys: 230 ms, total: 2.36 s\n", "Wall time: 2.35 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
val
countmean
category
A8334826499895.932267
B8328359500029.499730
C8334322499964.224303
D8329469500127.047667
E8334574499913.274837
F8338450500146.175426
\n", "
" ], "text/plain": [ " val \n", " count mean\n", "category \n", "A 8334826 499895.932267\n", "B 8328359 500029.499730\n", "C 8334322 499964.224303\n", "D 8329469 500127.047667\n", "E 8334574 499913.274837\n", "F 8338450 500146.175426" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time \n", "df.groupby('category').agg({'val':['count','mean']})" ] }, { "cell_type": "markdown", "id": "8985edde-358e-4c1e-9ef4-6da5e1ac2957", "metadata": {}, "source": [ "#### join 操作对比 (duckdb比pandas快170倍)" ] }, { "cell_type": "code", "execution_count": 12, "id": "5c34c594-c45d-4ca7-b5e4-934972fdad97", "metadata": {}, "outputs": [], "source": [ "dfcolor = pd.DataFrame(\n", " {\n", " 'cat': ['A','B','C','D','E','F'],\n", " 'color':['black','yellow','pink','blue','white','green'] \n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "695bc432-334c-4ec7-a769-37ce7b33dff1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.16 ms, sys: 17.7 ms, total: 21.9 ms\n", "Wall time: 35.4 ms\n" ] }, { "data": { "text/plain": [ "┌──────────┬────────────┬─────────┐\n", "│ category │ val │ color │\n", "│ varchar │ double │ varchar │\n", "├──────────┼────────────┼─────────┤\n", "│ D │ 869039.467 │ blue │\n", "│ B │ 353716.066 │ yellow │\n", "│ C │ 180675.829 │ pink │\n", "│ D │ 964817.406 │ blue │\n", "│ E │ 756984.249 │ white │\n", "│ C │ 528946.308 │ pink │\n", "│ C │ 513503.377 │ pink │\n", "│ F │ 834033.897 │ green │\n", "│ E │ 919872.73 │ white │\n", "│ A │ 699220.429 │ black │\n", "│ · │ · │ · │\n", "│ · │ · │ · │\n", "│ · │ · │ · │\n", "│ A │ 901283.968 │ black │\n", "│ C │ 942697.15 │ pink │\n", "│ A │ 780004.71 │ black │\n", "│ E │ 207837.189 │ white │\n", "│ D │ 403110.826 │ blue │\n", "│ A │ 672348.467 │ black │\n", "│ D │ 877300.687 │ blue │\n", "│ E │ 337017.408 │ white │\n", "│ E │ 762162.943 │ white │\n", "│ D │ 297255.887 │ blue │\n", "├──────────┴────────────┴─────────┤\n", "│ ? rows (>9999 rows, 20 shown) │\n", "└─────────────────────────────────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "tb_color = duckdb.from_df(dfcolor)\n", "duckdb.sql('select tb.category, tb.val, tb_color.color from tb join tb_color on tb.category = tb_color.cat')" ] }, { "cell_type": "code", "execution_count": 15, "id": "03e0c0f9-7d1e-488f-a205-e1a9bdbed813", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.23 s, sys: 851 ms, total: 6.08 s\n", "Wall time: 6.07 s\n" ] } ], "source": [ "%%time\n", "dfjoin = df.merge(dfcolor,left_on='category',right_on='cat')" ] }, { "cell_type": "code", "execution_count": null, "id": "2bfe4248-7135-4b77-8e51-3cbf0b991642", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5b1ca0e8-3c08-4c61-87fa-da56e908876b", "metadata": {}, "source": [ "### 2,自定义函数UDF" ] }, { "cell_type": "markdown", "id": "18825295-9263-414c-a198-07daf7b4bb08", "metadata": {}, "source": [ "pyspark的一个优点是可以在sql中使用注册自定义函数(UDF),比较灵活。\n", "\n", "那么duckdb支持在sql中使用注册自定义函数吗? of course!\n", "\n", "* 注册方法:使用create_function方法来注册一个Python函数作为UDF。需要提供函数名称、Python函数、参数类型和返回类型。\n", "\n", "* 类型注解:如果Python函数有类型注解,可以省略parameters和return_type参数,DuckDB会根据注解自动推断。\n", "\n", "* 空值处理:默认情况下,当UDF接收到NULL值时,会立即返回NULL。如果需要特殊处理,可以设置null_handling=\"special\"。\n", "\n", "* 异常处理:默认情况下,如果Python函数抛出异常,DuckDB会重新抛出该异常。如果希望改为返回null,可以设置exception_handling=\"return_null\"。\n", "\n", "* 副作用:如果UDF的结果受随机性影响,需要将side_effects设置为True。\n", "\n", "* 使用Arrow:如果函数需要接收Arrow数组,设置type='arrow'。这会通知系统提供Arrow数组给函数,并期望函数返回相同数量的数组。\n", "\n", "* 使用Native:当设置type='native'时,函数将按单个元组接收数据,并返回单个值。这适用于与不操作Arrow的Python库交互,如faker。\n", "\n" ] }, { "cell_type": "code", "execution_count": 130, "id": "22504a63-5653-40ff-a1cc-529b5207fb40", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
function_namefunction_typedescriptionreturn_typeparametersparameter_typesexample
0randomscalarReturns a random number between 0 and 1DOUBLE[][]random()
1random_namescalarNoneVARCHAR[col0][BIGINT]None
2gen_random_uuidscalarReturns a random UUID similar to this: eeccb8c...UUID[][]uuid()
\n", "
" ], "text/plain": [ " function_name function_type \\\n", "0 random scalar \n", "1 random_name scalar \n", "2 gen_random_uuid scalar \n", "\n", " description return_type parameters \\\n", "0 Returns a random number between 0 and 1 DOUBLE [] \n", "1 None VARCHAR [col0] \n", "2 Returns a random UUID similar to this: eeccb8c... UUID [] \n", "\n", " parameter_types example \n", "0 [] random() \n", "1 [BIGINT] None \n", "2 [] uuid() " ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#先查看当前duckdb函数中带有 rand的函数\n", "def get_rand_funs():\n", " query = \"\"\"\n", " SELECT function_name,function_type,description,\n", " return_type,parameters,parameter_types,\n", " example \n", " FROM duckdb_functions() \n", " where function_name like '%rand%' \n", " \"\"\"\n", " df_funs = duckdb.sql(query).to_df()\n", " return df_funs\n", "\n", "df_funs = get_rand_funs()\n", "df_funs" ] }, { "cell_type": "code", "execution_count": 136, "id": "0eb78984-a133-4e1f-98de-221ad320c4f1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import duckdb\n", "from faker import Faker\n", "def generate_random_name(i:int) -> str: \n", " fake = Faker()\n", " fake.random.seed(i)\n", " \n", " #演示异常逻辑处理\n", " #if i%10==0:\n", " # raise Exception('error') \n", " \n", " name = fake.name()\n", " return name\n", "\n", "#移除UDF\n", "if 'random_name' in get_rand_funs()['function_name'].tolist():\n", " duckdb.remove_function('random_name') \n", "\n", " \n", "#注册UDF\n", "duckdb.create_function(\"random_name\", generate_random_name, \n", " [duckdb.typing.BIGINT], duckdb.typing.VARCHAR, \n", " exception_handling=\"return_null\")\n" ] }, { "cell_type": "code", "execution_count": 138, "id": "2eca208c-0800-441c-b02c-23b89d7fcb75", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "57839a472b354a77b254e0372e697757", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "┌─────────────────────┬───────┐\n", "│ name │ score │\n", "│ varchar │ int32 │\n", "├─────────────────────┼───────┤\n", "│ Norma Fisher │ 54 │\n", "│ Ryan Gallagher │ 8 │\n", "│ Theresa Brown │ 37 │\n", "│ Joshua Wood │ 58 │\n", "│ Brian Foster │ 93 │\n", "│ Natalie Pope │ 62 │\n", "│ Samantha Washington │ 49 │\n", "│ Chris Curtis │ 85 │\n", "│ Victor Martinez │ 79 │\n", "│ James Taylor │ 68 │\n", "│ · │ · │\n", "│ · │ · │\n", "│ · │ · │\n", "│ Brandon Keller │ 9 │\n", "│ Christopher Sutton │ 67 │\n", "│ Karl Grant │ 53 │\n", "│ Kathleen Knight │ 23 │\n", "│ Becky Ramirez │ 84 │\n", "│ Kayla Osborn │ 35 │\n", "│ Terry Foster │ 60 │\n", "│ Robert Watson │ 48 │\n", "│ Ronald Gomez │ 85 │\n", "│ Curtis White │ 55 │\n", "├─────────────────────┴───────┤\n", "│ 100 rows (20 shown) │\n", "└─────────────────────────────┘" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tb_student = duckdb.sql(\"SELECT random_name(i) as name, cast(100*random() as INT) as score from range(100) tbl(i)\")\n", "tb_student " ] }, { "cell_type": "code", "execution_count": 139, "id": "d17c4177-b6f0-4752-9306-9e6d5ccd9c64", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5f664ffcbe1e4dc0ba0c70de32c60792", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescore
0Norma Fisher19
1Ryan Gallagher40
2Theresa Brown36
3Joshua Wood50
4Brian Foster35
5Natalie Pope50
6Samantha Washington28
7Chris Curtis77
8Victor Martinez8
9James Taylor21
10Jessica Lane4
11Justin Glass9
12Matthew Estrada81
13Michael Sutton81
14Michael Pearson87
\n", "
" ], "text/plain": [ " name score\n", "0 Norma Fisher 19\n", "1 Ryan Gallagher 40\n", "2 Theresa Brown 36\n", "3 Joshua Wood 50\n", "4 Brian Foster 35\n", "5 Natalie Pope 50\n", "6 Samantha Washington 28\n", "7 Chris Curtis 77\n", "8 Victor Martinez 8\n", "9 James Taylor 21\n", "10 Jessica Lane 4\n", "11 Justin Glass 9\n", "12 Matthew Estrada 81\n", "13 Michael Sutton 81\n", "14 Michael Pearson 87" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.options.display.max_rows=1000\n", "tb_student.to_df().head(15)" ] }, { "cell_type": "code", "execution_count": 142, "id": "da358629-d2dd-4c53-aef2-d3ee3314655c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
function_namefunction_typedescriptionreturn_typeparametersparameter_typesexample
0randomscalarReturns a random number between 0 and 1DOUBLE[][]random()
1random_namescalarNoneVARCHAR[col0][BIGINT]None
2gen_random_uuidscalarReturns a random UUID similar to this: eeccb8c...UUID[][]uuid()
\n", "
" ], "text/plain": [ " function_name function_type \\\n", "0 random scalar \n", "1 random_name scalar \n", "2 gen_random_uuid scalar \n", "\n", " description return_type parameters \\\n", "0 Returns a random number between 0 and 1 DOUBLE [] \n", "1 None VARCHAR [col0] \n", "2 Returns a random UUID similar to this: eeccb8c... UUID [] \n", "\n", " parameter_types example \n", "0 [] random() \n", "1 [BIGINT] None \n", "2 [] uuid() " ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#再看看当前duckdb函数中带有 rand的函数,可以看到包括我们自定义的random_name了\n", "get_rand_funs()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "402df2c3-023d-4815-b7ce-d45761ab9a33", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e91e4fc6-aa99-4682-94f2-4f2b44debebc", "metadata": {}, "source": [ "## 三,输出落表" ] }, { "cell_type": "code", "execution_count": 144, "id": "b9df6da4-b1ef-47b5-91c2-85f7da688701", "metadata": {}, "outputs": [], "source": [ "tb_student.to_parquet('student.parquet')" ] }, { "cell_type": "markdown", "id": "39c650d5-dd29-4388-921d-f76dc9cb8115", "metadata": {}, "source": [ "## " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }