415
什么是ClickHouseClickHouse是一个用于联机分析(OLAP)的列式数据库管理系统(DBMS)在传统的行式数据库系统中,数据按如下顺序存储: Row WatchID JavaEnable Title GoodEvent EventTime #0 89354350662 1 Investor Relations 1 2016-05-18 05:19:20 #1 90329509958 0 Contact us 1 2016-05-18 08:10:20 #2 89953706054 1 Mission 1 2016-05-18 07:38:00 #N 处于同一行中的数据总是被物理的存储在一起。 常见的行式数据库系统有: MySQLPostgresMS SQL Server在列式数据库系统中,数据按如下的顺序存储: Row: #0 #1 #2 #N WatchID: 89354350662 90329509958 89953706054 JavaEnable: 1 0 1 Title: Investor Relations Contact us Mission GoodEvent: 1 1 1 EventTime: 2016-05-18 05:19:20 2016-05-18 08:10:20 2016-05-18 07:38:00 这些示例只显示了数据的排列顺序。来自不同列的值被单独存储,来自同一列的数据被存储在一起。 常见的列式数据库有: VerticaParaccel (Actian MatrixAmazon Redshift)Sybase IQExasolInfobrightInfiniDBMonetDB (VectorWiseActian Vector)LucidDBSAP HANAGoogle DremelGoogle PowerDrillDruidkdb+不同的数据存储方式适用不同的业务场景,数据访问的场景包括:进行了何种查询、多久查询一次以及各类查询的比例;每种类型的查询(行、列和字节)读取多少数据;读取数据 和更新之间的关系;使用的数据集大小以及如何使用本地的数据集;是否使用事务,以及它们是如何进行隔离的;数据的复制机制与数据的完整性要求;每种类型的查询要求的延 迟与吞吐量等等。 系统负载越高,依据使用场景进行定制化就越重要,并且定制将会变的越精细。没有一个系统能够同时适用所有不同的业务场景。如果系统适用于广泛的场景,在负载高的情况 下,要兼顾所有的场景,那么将不得不做出选择。是要平衡还是要效率? OLAP场景的关键特征 绝大多数是读请求 数据以相当大的批次(> 1000)更新,而不是单行更新;或者根本没有更新。 已添加到数据库的数据不能修改。 对于读取,从数据库中提取相当多的行,但只提取列的一小部分。 宽表,即每个表包含着大量的列 查询相对较少(通常每台服务器每秒查询数百次或更少) 对于简单查询,允许延迟大约50毫秒 列中的数据相对较小:数字和短字符串(例如,每个URL 60个字节) 处理单个查询时需要高吞吐量(每台服务器每秒可达数十亿行) 事务不是必须的 对数据一致性要求低 每个查询有一个大表。除了他以外,其他的都很小。 查询结果明显小于源数据。换句话说,数据经过过滤或聚合,因此结果适合于单个服务器的RAM很容易可以看出,OLAP场景与其他通常业务场景(例如,OLTPK/V)有很大的不同, 因此想要使用OLTPKey-Value数据库去高效的处理分析查询场景,并不是非常完美的适用 方案。例如,使用OLAP数据库去处理分析请求通常要优于使用MongoDBRedis去处理分析请求。 列式数据库更适合OLAP场景的原因 列式数据库更适合于OLAP场景(对于大多数查询而言,处理速度至少提高了100),下面详细解释了原因(通过图片更有利于直观理解)行式 行式

ClickHouse文档 | ClickHouse文档 · 操作代码包含一个优化的内 部循环。 2. 代码生成:生成一段代码,包含查询中的所有操作。 这是不应该在一个通用数据库中实现的,因为这在运行简单查询时是没有意义的。但是也有例外,例如,MemSQL使用代码生成来减少处理SQL查询的延迟(只是为了比较,分析

  • Upload
    others

  • View
    36

  • Download
    0

Embed Size (px)

Citation preview

  • 什么是ClickHouse? ClickHouse是一个用于联机分析(OLAP)的列式数据库管理系统(DBMS)。

    在传统的行式数据库系统中,数据按如下顺序存储:

    Row WatchID JavaEnable Title GoodEvent EventTime

    #0 89354350662 1 Investor Relations 1 2016-05-18 05:19:20

    #1 90329509958 0 Contact us 1 2016-05-18 08:10:20

    #2 89953706054 1 Mission 1 2016-05-18 07:38:00

    #N … … … … …

    处于同一行中的数据总是被物理的存储在一起。

    常见的行式数据库系统有:MySQL、Postgres和MS SQL Server。

    在列式数据库系统中,数据按如下的顺序存储:

    Row: #0 #1 #2 #N

    WatchID: 89354350662 90329509958 89953706054 …

    JavaEnable: 1 0 1 …

    Title: Investor Relations Contact us Mission …

    GoodEvent: 1 1 1 …

    EventTime: 2016-05-18 05:19:20 2016-05-18 08:10:20 2016-05-18 07:38:00 …

    这些示例只显示了数据的排列顺序。来自不同列的值被单独存储,来自同一列的数据被存储在一起。

    常见的列式数据库有: Vertica、 Paraccel (Actian Matrix,Amazon Redshift)、 Sybase IQ、 Exasol、 Infobright、 InfiniDB、 MonetDB (VectorWise, ActianVector)、 LucidDB、 SAP HANA、 Google Dremel、 Google PowerDrill、 Druid、 kdb+。

    不同的数据存储方式适用不同的业务场景,数据访问的场景包括:进行了何种查询、多久查询一次以及各类查询的比例;每种类型的查询(行、列和字节)读取多少数据;读取数据和更新之间的关系;使用的数据集大小以及如何使用本地的数据集;是否使用事务,以及它们是如何进行隔离的;数据的复制机制与数据的完整性要求;每种类型的查询要求的延迟与吞吐量等等。

    系统负载越高,依据使用场景进行定制化就越重要,并且定制将会变的越精细。没有一个系统能够同时适用所有不同的业务场景。如果系统适用于广泛的场景,在负载高的情况

    下,要兼顾所有的场景,那么将不得不做出选择。是要平衡还是要效率?

    OLAP场景的关键特征 绝大多数是读请求

    数据以相当大的批次(> 1000行)更新,而不是单行更新;或者根本没有更新。已添加到数据库的数据不能修改。

    对于读取,从数据库中提取相当多的行,但只提取列的一小部分。

    宽表,即每个表包含着大量的列

    查询相对较少(通常每台服务器每秒查询数百次或更少)对于简单查询,允许延迟大约50毫秒列中的数据相对较小:数字和短字符串(例如,每个URL 60个字节)处理单个查询时需要高吞吐量(每台服务器每秒可达数十亿行)事务不是必须的

    对数据一致性要求低

    每个查询有一个大表。除了他以外,其他的都很小。

    查询结果明显小于源数据。换句话说,数据经过过滤或聚合,因此结果适合于单个服务器的RAM中

    很容易可以看出,OLAP场景与其他通常业务场景(例如,OLTP或K/V)有很大的不同, 因此想要使用OLTP或Key-Value数据库去高效的处理分析查询场景,并不是非常完美的适用方案。例如,使用OLAP数据库去处理分析请求通常要优于使用MongoDB或Redis去处理分析请求。

    列式数据库更适合OLAP场景的原因 列式数据库更适合于OLAP场景(对于大多数查询而言,处理速度至少提高了100倍),下面详细解释了原因(通过图片更有利于直观理解):

    行式行式

  • 列式列式

    看到差别了么?下面将详细介绍为什么会发生这种情况。

    输入/输出 1. 针对分析类查询,通常只需要读取表的一小部分列。在列式数据库中你可以只读取你需要的数据。例如,如果只需要读取100列中的5列,这将帮助你最少减少20倍的I/O消耗。

    2. 由于数据总是打包成批量读取的,所以压缩是非常容易的。同时数据按列分别存储这也更容易压缩。这进一步降低了I/O的体积。3. 由于I/O的降低,这将帮助更多的数据被系统缓存。

    例如,查询«统计每个广告平台的记录数量»需要读取«广告平台ID»这一列,它在未压缩的情况下需要1个字节进行存储。如果大部分流量不是来自广告平台,那么这一列至少可以以十倍的压缩率被压缩。当采用快速压缩算法,它的解压速度最少在十亿字节(未压缩数据)每秒。换句话说,这个查询可以在单个服务器上以每秒大约几十亿行的速度进行处理。这实际上是当前实现的速度。

    CPU 由于执行一个查询需要处理大量的行,因此在整个向量上执行所有操作将比在每一行上执行所有操作更加高效。同时这将有助于实现一个几乎没有调用成本的查询引擎。如果你不

    这样做,使用任何一个机械硬盘,查询引擎都不可避免的停止CPU进行等待。所以,在数据按列存储并且按列执行是很有意义的。

    有两种方法可以做到这一点:

    1. 向量引擎:所有的操作都是为向量而不是为单个值编写的。这意味着多个操作之间的不再需要频繁的调用,并且调用的成本基本可以忽略不计。操作代码包含一个优化的内部循环。

    2. 代码生成:生成一段代码,包含查询中的所有操作。

    这是不应该在一个通用数据库中实现的,因为这在运行简单查询时是没有意义的。但是也有例外,例如,MemSQL使用代码生成来减少处理SQL查询的延迟(只是为了比较,分析型数据库通常需要优化的是吞吐而不是延迟)。

    请注意,为了提高CPU效率,查询语言必须是声明型的(SQL或MDX), 或者至少一个向量(J,K)。 查询应该只包含隐式循环,允许进行优化。

    来源文章

    入门 如果您是ClickHouse的新手,并希望亲身体验它的性能。

    首先需要进行 环境安装与部署.

    之后,您可以通过教程与示例数据完成自己的入门第一步:

    QuickStart教程 快速了解Clickhouse的操作流程示例数据集-航班飞行数据 示例数据,提供了常用的SQL查询场景

    来源文章

    安装 系统要求 ClickHouse可以在任何具有x86_64,AArch64或PowerPC64LE CPU架构的Linux,FreeBSD或Mac OS X上运行。

    官方预构建的二进制文件通常针对x86_64进行编译,并利用SSE 4.2指令集,因此,除非另有说明,支持它的CPU使用将成为额外的系统需求。下面是检查当前CPU是否支持SSE4.2的命令:

    $ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

    要在不支持SSE 4.2或AArch64,PowerPC64LE架构的处理器上运行ClickHouse,您应该通过适当的配置调整从源代码构建ClickHouse。

    可用安装包

    https://clickhouse.tech/docs/zh/https://clickhouse.tech/docs/zh/getting_started/

  • DEB安装包 建议使用Debian或Ubuntu的官方预编译deb软件包。运行以下命令来安装包:

    sudo apt-get install apt-transport-https ca-certificates dirmngrsudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4

    echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee \ /etc/apt/sources.list.d/clickhouse.listsudo apt-get update

    sudo apt-get install -y clickhouse-server clickhouse-client

    sudo service clickhouse-server startclickhouse-client

    如果您想使用最新的版本,请用testing替代stable(我们只推荐您用于测试环境)。

    你也可以从这里手动下载安装包:下载。

    安装包列表:

    clickhouse-common-static — ClickHouse编译的二进制文件。clickhouse-server — 创建clickhouse-server软连接,并安装默认配置服务clickhouse-client — 创建clickhouse-client客户端工具软连接,并安装客户端配置文件。clickhouse-common-static-dbg — 带有调试信息的ClickHouse二进制文件。

    RPM安装包 推荐使用CentOS、RedHat和所有其他基于rpm的Linux发行版的官方预编译rpm包。

    首先,您需要添加官方存储库:

    sudo yum install yum-utilssudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPGsudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64

    如果您想使用最新的版本,请用testing替代stable(我们只推荐您用于测试环境)。prestable有时也可用。

    然后运行命令安装:

    sudo yum install clickhouse-server clickhouse-client

    你也可以从这里手动下载安装包:下载。

    Tgz安装包 如果您的操作系统不支持安装deb或rpm包,建议使用官方预编译的tgz软件包。

    所需的版本可以通过curl或wget从存储库https://repo.clickhouse.tech/tgz/下载。

    下载后解压缩下载资源文件并使用安装脚本进行安装。以下是一个最新版本的安装示例:

    export LATEST_VERSION=`curl https://api.github.com/repos/ClickHouse/ClickHouse/tags 2>/dev/null | grep -Eo '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -n 1`curl -O https://repo.clickhouse.tech/tgz/clickhouse-common-static-$LATEST_VERSION.tgzcurl -O https://repo.clickhouse.tech/tgz/clickhouse-common-static-dbg-$LATEST_VERSION.tgzcurl -O https://repo.clickhouse.tech/tgz/clickhouse-server-$LATEST_VERSION.tgzcurl -O https://repo.clickhouse.tech/tgz/clickhouse-client-$LATEST_VERSION.tgz

    tar -xzvf clickhouse-common-static-$LATEST_VERSION.tgzsudo clickhouse-common-static-$LATEST_VERSION/install/doinst.sh

    tar -xzvf clickhouse-common-static-dbg-$LATEST_VERSION.tgzsudo clickhouse-common-static-dbg-$LATEST_VERSION/install/doinst.sh

    tar -xzvf clickhouse-server-$LATEST_VERSION.tgzsudo clickhouse-server-$LATEST_VERSION/install/doinst.shsudo /etc/init.d/clickhouse-server start

    tar -xzvf clickhouse-client-$LATEST_VERSION.tgzsudo clickhouse-client-$LATEST_VERSION/install/doinst.sh

    对于生产环境,建议使用最新的stable版本。你可以在GitHub页面https://github.com/ClickHouse/ClickHouse/tags找到它,它以后缀-stable标志。

    Docker安装包 要在Docker中运行ClickHouse,请遵循Docker Hub上的指南。它是官方的deb安装包。

    其他环境安装包 对于非linux操作系统和Arch64 CPU架构,ClickHouse将会以master分支的最新提交的进行编译提供(它将会有几小时的延迟)。

    macOS — curl -O 'https://builds.clickhouse.tech/master/macos/clickhouse' && chmod a+x ./clickhouseFreeBSD — curl -O 'https://builds.clickhouse.tech/master/freebsd/clickhouse' && chmod a+x ./clickhouseAArch64 — curl -O 'https://builds.clickhouse.tech/master/aarch64/clickhouse' && chmod a+x ./clickhouse

    下载后,您可以使用clickhouse client连接服务,或者使用clickhouse local模式处理数据,不过您必须要额外在GitHub下载server和users配置文件。

    不建议在生产环境中使用这些构建版本,因为它们没有经过充分的测试,但是您可以自行承担这样做的风险。它们只是ClickHouse功能的一个部分。

    使用源码安装 要手动编译ClickHouse, 请遵循Linux或Mac OS X说明。

    https://repo.clickhouse.tech/deb/stable/main/https://repo.clickhouse.tech/rpm/stable/x86_64https://hub.docker.com/r/yandex/clickhouse-server/https://builds.clickhouse.tech/master/macos/clickhousehttps://builds.clickhouse.tech/master/freebsd/clickhousehttps://builds.clickhouse.tech/master/aarch64/clickhousehttps://github.com/ClickHouse/ClickHouse/blob/master/programs/server/config.xmlhttps://github.com/ClickHouse/ClickHouse/blob/master/programs/server/users.xml

  • 您可以编译并安装它们,也可以使用不安装包的程序。通过手动构建,您可以禁用SSE 4.2或AArch64 cpu。

    Client: programs/clickhouse-client Server: programs/clickhouse-server

    您需要创建一个数据和元数据文件夹,并为所需的用户chown授权。它们的路径可以在服务器配置(src/programs/server/config.xml)中改变,默认情况下它们是:

    /opt/clickhouse/data/default/ /opt/clickhouse/metadata/default/

    在Gentoo上,你可以使用emerge clickhouse从源代码安装ClickHouse。

    启动 如果没有service,可以运行如下命令在后台启动服务:

    $ sudo /etc/init.d/clickhouse-server start

    日志文件将输出在/var/log/clickhouse-server/文件夹。

    如果服务器没有启动,检查/etc/clickhouse-server/config.xml中的配置。

    您也可以手动从控制台启动服务器:

    $ clickhouse-server --config-file=/etc/clickhouse-server/config.xml

    在这种情况下,日志将被打印到控制台,这在开发过程中很方便。

    如果配置文件在当前目录中,则不需要指定——config-file参数。默认情况下,它的路径为./config.xml。

    ClickHouse支持访问限制设置。它们位于users.xml文件(与config.xml同级目录)。默认情况下,允许default用户从任何地方访问,不需要密码。可查看user/default/networks。更多信息,请参见Configuration Files。

    启动服务后,您可以使用命令行客户端连接到它:

    $ clickhouse-client

    默认情况下,使用default用户并不携带密码连接到localhost:9000。还可以使用--host参数连接到指定服务器。

    终端必须使用UTF-8编码。更多信息,请参阅Command-line client。

    示例:

    $ ./clickhouse-clientClickHouse client version 0.0.18749.Connecting to localhost:9000.Connected to ClickHouse server version 0.0.18749.

    :) SELECT 1

    SELECT 1

    ┌─1─┐│ 1 │└───┘

    1 rows in set. Elapsed: 0.003 sec.

    :)

    恭喜,系统已经工作了恭喜,系统已经工作了!

    为了继续进行实验,你可以尝试下载测试数据集或查看教程。

    原始文章

    ClickHouse教程 从本教程中可以获得什么? 通过学习本教程,您将了解如何设置一个简单的ClickHouse集群。它会很小,但是可以容错和扩展。然后,我们将使用其中一个示例数据集来填充数据并执行一些演示查询。

    单节点设置 为了延迟演示分布式环境的复杂性,我们将首先在单个服务器或虚拟机上部署ClickHouse。ClickHouse通常是从deb或rpm包安装,但对于不支持它们的操作系统也有其他方法。

    例如,您选择deb安装包,执行:

    https://clickhouse.tech/tutorial.htmlhttps://clickhouse.tech/docs/en/getting_started/install/

  • sudo apt-get install apt-transport-https ca-certificates dirmngrsudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4

    echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee \ /etc/apt/sources.list.d/clickhouse.listsudo apt-get update

    sudo apt-get install -y clickhouse-server clickhouse-client

    sudo service clickhouse-server startclickhouse-client

    在我们安装的软件中包含这些包:

    clickhouse-client 包,包含clickhouse-client客户端,它是交互式ClickHouse控制台客户端。clickhouse-common 包,包含一个ClickHouse可执行文件。clickhouse-server 包,包含要作为服务端运行的ClickHouse配置文件。

    服务器配置文件位于/etc/clickhouse-server/。在继续之前,请注意config.xml中的元素。它决定了数据存储的位置,因此它应该位于磁盘容量的卷上;默认值是/var/lib/clickhouse/。如果你想调整配置,直接编辑config是不方便的。考虑到它可能会在将来的包更新中被重写。建议重写配置元素的方法是在配置中创建config.d文件夹,作为config.xml的重写方式。

    你可能已经注意到了,clickhouse-server安装后不会自动启动。 它也不会在更新后自动重新启动。 您启动服务端的方式取决于您的初始系统,通常情况下是这样:

    sudo service clickhouse-server start

    sudo /etc/init.d/clickhouse-server start

    服务端日志的默认位置是/var/log/clickhouse-server/。当服务端在日志中记录Ready for connections消息,即表示服务端已准备好处理客户端连接。

    一旦clickhouse-server启动并运行,我们可以利用clickhouse-client连接到服务端,并运行一些测试查询,如SELECT "Hello, world!";.

    Clickhouse-client的快速提示的快速提示

    导入示例数据集 现在是时候用一些示例数据填充我们的ClickHouse服务端。 在本教程中,我们将使用Yandex.Metrica的匿名数据,它是在ClickHouse成为开源之前作为生产环境运行的第一个服务(关于这一点的更多内容请参阅ClickHouse历史)。多种导入Yandex.Metrica数据集方法,为了本教程,我们将使用最现实的一个。

    下载并提取表数据

    curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsvcurl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv

    提取的文件大小约为10GB。

    创建表 与大多数数据库管理系统一样,ClickHouse在逻辑上将表分组为数据库。包含一个default数据库,但我们将创建一个新的数据库tutorial:

    clickhouse-client --query "CREATE DATABASE IF NOT EXISTS tutorial"

    与创建数据库相比,创建表的语法要复杂得多(请参阅参考资料. 一般CREATE TABLE声明必须指定三个关键的事情:

    1. 要创建的表的名称。2. 表结构,例如:列名和对应的数据类型。3. 表引擎及其设置,这决定了对此表的查询操作是如何在物理层面执行的所有细节。

    Yandex.Metrica是一个网络分析服务,样本数据集不包括其全部功能,因此只有两个表可以创建:

    hits 表包含所有用户在服务所涵盖的所有网站上完成的每个操作。visits 表包含预先构建的会话,而不是单个操作。

    让我们看看并执行这些表的实际创建表查询:

    CREATE TABLE tutorial.hits_v1( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `URLDomain` String, `RefererDomain` String, `Refresh` UInt8, `IsRobot` UInt8,

  • `IsRobot` UInt8, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `UTCEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` Int32, `DNSTiming` Int32, `ConnectTiming` Int32, `ResponseStartTiming` Int32, `ResponseEndTiming` Int32, `FetchTiming` Int32, `RedirectTiming` Int32, `DOMInteractiveTiming` Int32, `DOMContentLoadedTiming` Int32, `DOMCompleteTiming` Int32, `LoadEventStartTiming` Int32, `LoadEventEndTiming` Int32, `NSToDOMContentLoadedTiming` Int32, `FirstPaintTiming` Int32, `RedirectCount` Int8, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `GoalsReached` Array(UInt32), `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32, `YCLID` UInt64, `ShareService` String, `ShareURL` String, `ShareTitle` String, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String,

  • Key4 String, Key5 String, ValueDouble Float64), `IslandID` FixedString(16), `RequestNum` UInt32, `RequestTry` UInt8)ENGINE = MergeTree()PARTITION BY toYYYYMM(EventDate)ORDER BY (CounterID, EventDate, intHash32(UserID))SAMPLE BY intHash32(UserID)

    CREATE TABLE tutorial.visits_v1( `CounterID` UInt32, `StartDate` Date, `Sign` Int8, `IsNew` UInt8, `VisitID` UInt64, `UserID` UInt64, `StartTime` DateTime, `Duration` UInt32, `UTCStartTime` DateTime, `PageViews` Int32, `Hits` Int32, `IsBounce` UInt8, `Referer` String, `StartURL` String, `RefererDomain` String, `StartURLDomain` String, `EndURL` String, `LinkURL` String, `IsDownload` UInt8, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `PlaceID` Int32, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `IsYandex` UInt8, `GoalReachesDepth` Int32, `GoalReachesURL` Int32, `GoalReachesAny` Int32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `MobilePhoneModel` String, `ClientEventTime` DateTime, `RegionID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `IPNetworkID` UInt32, `SilverlightVersion3` UInt32, `CodeVersion` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `UserAgentMajor` UInt16, `UserAgentMinor` UInt16, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `SilverlightVersion2` UInt8, `SilverlightVersion4` UInt16, `FlashVersion3` UInt16, `FlashVersion4` UInt16, `ClientTimeZone` Int16, `OS` UInt8, `UserAgent` UInt8, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `NetMajor` UInt8, `NetMinor` UInt8, `MobilePhone` UInt8, `SilverlightVersion1` UInt8, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `JavaEnable` UInt8, `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `BrowserLanguage` UInt16, `BrowserCountry` UInt16, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `Params` Array(String), `Goals` Nested( ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), `WatchIDs` Array(UInt64), `ParamSumPrice` Int64, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `ClickLogID` UInt64, `ClickEventID` Int32, `ClickGoodEvent` Int32, `ClickEventTime` DateTime, `ClickPriorityID` Int32, `ClickPhraseID` Int32,

  • `ClickPhraseID` Int32, `ClickPageID` Int32, `ClickPlaceID` Int32, `ClickTypeID` Int32, `ClickResourceID` Int32, `ClickCost` UInt32, `ClickClientIP` UInt32, `ClickDomainID` UInt32, `ClickURL` String, `ClickAttempt` UInt8, `ClickOrderID` UInt32, `ClickBannerID` UInt32, `ClickMarketCategoryID` UInt32, `ClickMarketPP` UInt32, `ClickMarketCategoryName` String, `ClickMarketPPName` String, `ClickAWAPSCampaignName` String, `ClickPageName` String, `ClickTargetType` UInt16, `ClickTargetPhraseID` UInt64, `ClickContextType` UInt8, `ClickSelectType` Int8, `ClickOptions` String, `ClickGroupBannerID` Int32, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `FirstVisit` DateTime, `PredLastVisit` Date, `LastVisit` Date, `TotalVisits` UInt32, `TraficSource` Nested( ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), `Attendance` FixedString(16), `CLID` UInt32, `YCLID` UInt64, `NormalizedRefererHash` UInt64, `SearchPhraseHash` UInt64, `RefererDomainHash` UInt64, `NormalizedStartURLHash` UInt64, `StartURLDomainHash` UInt64, `NormalizedEndURLHash` UInt64, `TopLevelDomain` UInt64, `URLScheme` UInt64, `OpenstatServiceNameHash` UInt64, `OpenstatCampaignIDHash` UInt64, `OpenstatAdIDHash` UInt64, `OpenstatSourceIDHash` UInt64, `UTMSourceHash` UInt64, `UTMMediumHash` UInt64, `UTMCampaignHash` UInt64, `UTMContentHash` UInt64, `UTMTermHash` UInt64, `FromHash` UInt64, `WebVisorEnabled` UInt8, `WebVisorActivity` UInt32, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `Market` Nested( Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), `IslandID` FixedString(16))ENGINE = CollapsingMergeTree(Sign)PARTITION BY toYYYYMM(StartDate)ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)SAMPLE BY intHash32(UserID)

    您可以使用clickhouse-client的交互模式执行这些查询(只需在终端中启动它,而不需要提前指定查询)。或者如果你愿意,可以尝试一些替代接口。

    正如我们所看到的, hits_v1使用 MergeTree引擎,而visits_v1使用 Collapsing引擎。

    导入数据 数据导入到ClickHouse是通过INSERT INTO方式完成的,查询类似许多SQL数据库。然而,数据通常是在一个提供支持序列化格式而不是VALUES子句(也支持)。

    我们之前下载的文件是以制表符分隔的格式,所以这里是如何通过控制台客户端导入它们:

  • clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsvclickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv

    ClickHouse有很多要调整的设置在控制台客户端中指定它们的一种方法是通过参数,就像我们看到上面语句中的--max_insert_block_size。找出可用的设置、含义及其默认值的最简单方法是查询system.settings 表:

    SELECT name, value, changed, descriptionFROM system.settingsWHERE name LIKE '%max_insert_b%'FORMAT TSV

    max_insert_block_size 1048576 0 "The maximum block size for insertion, if we control the creation of blocks for insertion."

    您也可以OPTIMIZE导入后的表。使用MergeTree-family引擎配置的表总是在后台合并数据部分以优化数据存储(或至少检查是否有意义)。 这些查询强制表引擎立即进行存储优化,而不是稍后一段时间执行:

    clickhouse-client --query "OPTIMIZE TABLE tutorial.hits_v1 FINAL"clickhouse-client --query "OPTIMIZE TABLE tutorial.visits_v1 FINAL"

    这些查询开始I/O和CPU密集型操作,所以如果表一直接收到新数据,最好不要管它,让合并在后台运行。

    现在我们可以检查表导入是否成功:

    clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1"clickhouse-client --query "SELECT COUNT(*) FROM tutorial.visits_v1"

    查询示例

    SELECT StartURL AS URL, AVG(Duration) AS AvgDurationFROM tutorial.visits_v1WHERE StartDate BETWEEN '2014-03-23' AND '2014-03-30'GROUP BY URLORDER BY AvgDuration DESCLIMIT 10

    SELECT sum(Sign) AS visits, sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits, (100. * goal_visits) / visits AS goal_percentFROM tutorial.visits_v1WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru')

    集群部署 ClickHouse集群是一个同质集群。 设置步骤:

    1. 在群集的所有机器上安装ClickHouse服务端2. 在配置文件中设置群集配置3. 在每个实例上创建本地表4. 创建一个分布式表

    分布式表实际上是一种view,映射到ClickHouse集群的本地表。 从分布式表中执行SELECT查询会使用集群所有分片的资源。 您可以为多个集群指定configs,并创建多个分布式表,为不同的集群提供视图。

    具有三个分片,每个分片一个副本的集群的示例配置:

    example-perftest01j.yandex.ru 9000 example-perftest02j.yandex.ru 9000 example-perftest03j.yandex.ru 9000

    为了进一步演示,让我们使用和创建hits_v1表相同的CREATE TABLE语句创建一个新的本地表,但表名不同:

    CREATE TABLE tutorial.hits_local (...) ENGINE = MergeTree() ...

    创建提供集群本地表视图的分布式表:

  • CREATE TABLE tutorial.hits_all AS tutorial.hits_localENGINE = Distributed(perftest_3shards_1replicas, tutorial, hits_local, rand());

    常见的做法是在集群的所有计算机上创建类似的分布式表。 它允许在群集的任何计算机上运行分布式查询。 还有一个替代选项可以使用以下方法为给定的SELECT查询创建临时分布式表远程表功能。

    让我们运行INSERT SELECT将该表传播到多个服务器。

    INSERT INTO tutorial.hits_all SELECT * FROM tutorial.hits_v1;

    正如您所期望的那样,如果计算量大的查询使用3台服务器而不是一个,则运行速度快N倍。

    在这种情况下,我们使用了具有3个分片的集群,每个分片都包含一个副本。

    为了在生产环境中提供弹性,我们建议每个分片应包含分布在多个可用区或数据中心(或至少机架)之间的2-3个副本。 请注意,ClickHouse支持无限数量的副本。

    包含三个副本的一个分片集群的示例配置:

    ... example-perftest01j.yandex.ru 9000 example-perftest02j.yandex.ru 9000 example-perftest03j.yandex.ru 9000

    启用本机复制Zookeeper是必需的。 ClickHouse负责所有副本的数据一致性,并在失败后自动运行恢复过程。建议将ZooKeeper集群部署在单独的服务器上(其中没有其他进程,包括运行的ClickHouse)。

    ZooKeeper位置在配置文件中指定:

    zoo01.yandex.ru 2181 zoo02.yandex.ru 2181 zoo03.yandex.ru 2181

    此外,我们需要设置宏来识别每个用于创建表的分片和副本:

    01 01

    如果在创建复制表时没有副本,则会实例化新的第一个副本。 如果已有实时副本,则新副本将克隆现有副本中的数据。 您可以选择首先创建所有复制的表,然后向其中插入数据。 另一种选择是创建一些副本,并在数据插入之后或期间添加其他副本。

    CREATE TABLE tutorial.hits_replica (...)ENGINE = ReplcatedMergeTree( '/clickhouse_perftest/tables/{shard}/hits', '{replica}')...

    在这里,我们使用ReplicatedMergeTree表引擎。 在参数中,我们指定包含分片和副本标识符的ZooKeeper路径。

    注意:这种方法不适合大型表的分片。 有一个单独的工具 clickhouse-copier 这可以重新分片任意大表。

    注意ZooKeeper不是一个严格的要求:在某些简单的情况下,您可以通过将数据写入应用程序代码中的所有副本来复制数据。 这种方法是不不建议的,在这种情况下,ClickHouse将无法保证所有副本上的数据一致性。 因此需要由您的应用来保证这一点。

    http://zookeeper.apache.org/

  • INSERT INTO tutorial.hits_replica SELECT * FROM tutorial.hits_local;

    复制在多主机模式下运行。数据可以加载到任何副本中,然后系统自动将其与其他实例同步。复制是异步的,因此在给定时刻,并非所有副本都可能包含最近插入的数据。至少应

    该有一个副本允许数据摄入。另一些则会在重新激活后同步数据并修复一致性。请注意,这种方法允许最近插入的数据丢失的可能性很低。

    原始文章

    示例数据集 本节介绍如何获取示例数据集并将其导入ClickHouse。对于某些数据集,还可以使用示例查询。

    对于某些数据集示例查询也可用。

    Anonymized Yandex.Metrica DatasetStar Schema BenchmarkWikiStatTerabyte of Click Logs from CriteoAMPLab Big Data BenchmarkNew York Taxi DataOnTime

    原始文章

    Anonymized Yandex.Metrica Data 数据集由两个表组成,包含关于Yandex.Metrica的hits(hits_v1)和visit(visits_v1)的匿名数据。你可以阅读更多关于Yandex的信息。在ClickHouse历史的Metrica部分。

    数据集由两个表组成,他们中的任何一个都可以下载作为一个压缩tsv.xz的文件或准备的分区。除此之外,一个扩展版的hits表包含1亿行TSV在https://datasets.clickhouse.tech/hits/tsv/hits_100m_obfuscated_v1.tsv.xz,准备分区在https://datasets.clickhouse.tech/hits/partitions/hits_100m_obfuscated_v1.tar.xz。

    从准备好的分区获取表 下载和导入hits表:

    curl -O https://datasets.clickhouse.tech/hits/partitions/hits_v1.tartar xvf hits_v1.tar -C /var/lib/clickhouse # path to ClickHouse data directory## check permissions on unpacked data, fix if requiredsudo service clickhouse-server restartclickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"

    下载和导入visits表:

    curl -O https://datasets.clickhouse.tech/visits/partitions/visits_v1.tartar xvf visits_v1.tar -C /var/lib/clickhouse # path to ClickHouse data directory## check permissions on unpacked data, fix if requiredsudo service clickhouse-server restartclickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"

    从TSV压缩文件获取表 从TSV压缩文件下载并导入hits:

    curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv## now create tableclickhouse-client --query "CREATE DATABASE IF NOT EXISTS datasets"clickhouse-client --query "CREATE TABLE datasets.hits_v1 ( WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String, Refresh UInt8, IsRobot UInt8, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32), RefererRegions Array(UInt32), ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), UTCEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), RemoteIP UInt32, RemoteIP6 FixedString(16), WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming Int32, DNSTiming Int32, ConnectTiming Int32, ResponseStartTiming Int32, ResponseEndTiming Int32, FetchTiming Int32, RedirectTiming Int32, DOMInteractiveTiming Int32, DOMContentLoadedTiming Int32, DOMCompleteTiming Int32, LoadEventStartTiming Int32, LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32, FirstPaintTiming Int32, RedirectCount Int8, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, GoalsReached Array(UInt32), OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32, YCLID UInt64, ShareService String, ShareURL String, ShareTitle String, ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192"## import datacat hits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.hits_v1 FORMAT TSV" --max_insert_block_size=100000## optionally you can optimize tableclickhouse-client --query "OPTIMIZE TABLE datasets.hits_v1 FINAL"clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"

    从压缩tsv文件下载和导入visits:

    https://clickhouse.tech/docs/en/getting_started/tutorial/https://clickhouse.tech/docs/en/getting_started/example_datasets

  • curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv## now create tableclickhouse-client --query "CREATE DATABASE IF NOT EXISTS datasets"clickhouse-client --query "CREATE TABLE datasets.visits_v1 ( CounterID UInt32, StartDate Date, Sign Int8, IsNew UInt8, VisitID UInt64, UserID UInt64, StartTime DateTime, Duration UInt32, UTCStartTime DateTime, PageViews Int32, Hits Int32, IsBounce UInt8, Referer String, StartURL String, RefererDomain String, StartURLDomain String, EndURL String, LinkURL String, IsDownload UInt8, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, PlaceID Int32, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32), RefererRegions Array(UInt32), IsYandex UInt8, GoalReachesDepth Int32, GoalReachesURL Int32, GoalReachesAny Int32, SocialSourceNetworkID UInt8, SocialSourcePage String, MobilePhoneModel String, ClientEventTime DateTime, RegionID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RemoteIP UInt32, RemoteIP6 FixedString(16), IPNetworkID UInt32, SilverlightVersion3 UInt32, CodeVersion UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, UserAgentMajor UInt16, UserAgentMinor UInt16, WindowClientWidth UInt16, WindowClientHeight UInt16, SilverlightVersion2 UInt8, SilverlightVersion4 UInt16, FlashVersion3 UInt16, FlashVersion4 UInt16, ClientTimeZone Int16, OS UInt8, UserAgent UInt8, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, NetMajor UInt8, NetMinor UInt8, MobilePhone UInt8, SilverlightVersion1 UInt8, Age UInt8, Sex UInt8, Income UInt8, JavaEnable UInt8, CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, BrowserLanguage UInt16, BrowserCountry UInt16, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), Params Array(String), Goals Nested(ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), WatchIDs Array(UInt64), ParamSumPrice Int64, ParamCurrency FixedString(3), ParamCurrencyID UInt16, ClickLogID UInt64, ClickEventID Int32, ClickGoodEvent Int32, ClickEventTime DateTime, ClickPriorityID Int32, ClickPhraseID Int32, ClickPageID Int32, ClickPlaceID Int32, ClickTypeID Int32, ClickResourceID Int32, ClickCost UInt32, ClickClientIP UInt32, ClickDomainID UInt32, ClickURL String, ClickAttempt UInt8, ClickOrderID UInt32, ClickBannerID UInt32, ClickMarketCategoryID UInt32, ClickMarketPP UInt32, ClickMarketCategoryName String, ClickMarketPPName String, ClickAWAPSCampaignName String, ClickPageName String, ClickTargetType UInt16, ClickTargetPhraseID UInt64, ClickContextType UInt8, ClickSelectType Int8, ClickOptions String, ClickGroupBannerID Int32, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, FirstVisit DateTime, PredLastVisit Date, LastVisit Date, TotalVisits UInt32, TraficSource Nested(ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), Attendance FixedString(16), CLID UInt32, YCLID UInt64, NormalizedRefererHash UInt64, SearchPhraseHash UInt64, RefererDomainHash UInt64, NormalizedStartURLHash UInt64, StartURLDomainHash UInt64, NormalizedEndURLHash UInt64, TopLevelDomain UInt64, URLScheme UInt64, OpenstatServiceNameHash UInt64, OpenstatCampaignIDHash UInt64, OpenstatAdIDHash UInt64, OpenstatSourceIDHash UInt64, UTMSourceHash UInt64, UTMMediumHash UInt64, UTMCampaignHash UInt64, UTMContentHash UInt64, UTMTermHash UInt64, FromHash UInt64, WebVisorEnabled UInt8, WebVisorActivity UInt32, ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), Market Nested(Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), IslandID FixedString(16)) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192"## import datacat visits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.visits_v1 FORMAT TSV" --max_insert_block_size=100000## optionally you can optimize tableclickhouse-client --query "OPTIMIZE TABLE datasets.visits_v1 FINAL"clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"

    查询示例 使用教程是以Yandex.Metrica数据集开始教程。

    可以在ClickHouse的stateful tests 中找到对这些表的查询的其他示例(它们被命名为test.hists和test.visits)。

    Star Schema Benchmark 编译 dbgen:

    $ git clone [email protected]:vadimtk/ssb-dbgen.git$ cd ssb-dbgen$ make

    开始生成数据:

    $ ./dbgen -s 1000 -T c$ ./dbgen -s 1000 -T l$ ./dbgen -s 1000 -T p$ ./dbgen -s 1000 -T s$ ./dbgen -s 1000 -T d

    在ClickHouse中创建数据表:

    注意使用-s 100dbgen将生成6亿行数据(67GB), 如果使用-s 1000它会生成60亿行数据(这需要很多时间))

    https://github.com/ClickHouse/ClickHouse/tree/master/tests/queries/1_stateful

  • CREATE TABLE customer( C_CUSTKEY UInt32, C_NAME String, C_ADDRESS String, C_CITY LowCardinality(String), C_NATION LowCardinality(String), C_REGION LowCardinality(String), C_PHONE String, C_MKTSEGMENT LowCardinality(String))ENGINE = MergeTree ORDER BY (C_CUSTKEY);

    CREATE TABLE lineorder( LO_ORDERKEY UInt32, LO_LINENUMBER UInt8, LO_CUSTKEY UInt32, LO_PARTKEY UInt32, LO_SUPPKEY UInt32, LO_ORDERDATE Date, LO_ORDERPRIORITY LowCardinality(String), LO_SHIPPRIORITY UInt8, LO_QUANTITY UInt8, LO_EXTENDEDPRICE UInt32, LO_ORDTOTALPRICE UInt32, LO_DISCOUNT UInt8, LO_REVENUE UInt32, LO_SUPPLYCOST UInt32, LO_TAX UInt8, LO_COMMITDATE Date, LO_SHIPMODE LowCardinality(String))ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);

    CREATE TABLE part( P_PARTKEY UInt32, P_NAME String, P_MFGR LowCardinality(String), P_CATEGORY LowCardinality(String), P_BRAND LowCardinality(String), P_COLOR LowCardinality(String), P_TYPE LowCardinality(String), P_SIZE UInt8, P_CONTAINER LowCardinality(String))ENGINE = MergeTree ORDER BY P_PARTKEY;

    CREATE TABLE supplier( S_SUPPKEY UInt32, S_NAME String, S_ADDRESS String, S_CITY LowCardinality(String), S_NATION LowCardinality(String), S_REGION LowCardinality(String), S_PHONE String)ENGINE = MergeTree ORDER BY S_SUPPKEY;

    写入数据:

    $ clickhouse-client --query "INSERT INTO customer FORMAT CSV" < customer.tbl$ clickhouse-client --query "INSERT INTO part FORMAT CSV" < part.tbl$ clickhouse-client --query "INSERT INTO supplier FORMAT CSV" < supplier.tbl$ clickhouse-client --query "INSERT INTO lineorder FORMAT CSV" < lineorder.tbl

    将star schema转换为flat schema:

    SET max_memory_usage = 20000000000, allow_experimental_multiple_joins_emulation = 1;

    CREATE TABLE lineorder_flatENGINE = MergeTreePARTITION BY toYear(LO_ORDERDATE)ORDER BY (LO_ORDERDATE, LO_ORDERKEY) ASSELECT l.*, c.*, s.*, p.*FROM lineorder l ANY INNER JOIN customer c ON (c.C_CUSTKEY = l.LO_CUSTKEY) ANY INNER JOIN supplier s ON (s.S_SUPPKEY = l.LO_SUPPKEY) ANY INNER JOIN part p ON (p.P_PARTKEY = l.LO_PARTKEY);

    ALTER TABLE lineorder_flat DROP COLUMN C_CUSTKEY, DROP COLUMN S_SUPPKEY, DROP COLUMN P_PARTKEY;

    运行查询:

    Q1.1

    SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;

    Q1.2

    SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;

    Q1.3

  • SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994 AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;

    Q2.1

    SELECT sum(LO_REVENUE), toYear(LO_ORDERDATE) AS year, P_BRAND FROM lineorder_flat WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA' GROUP BY year, P_BRAND ORDER BY year, P_BRAND;

    Q2.2

    SELECT sum(LO_REVENUE), toYear(LO_ORDERDATE) AS year, P_BRAND FROM lineorder_flat WHERE P_BRAND BETWEEN 'MFGR#2221' AND 'MFGR#2228' AND S_REGION = 'ASIA' GROUP BY year, P_BRAND ORDER BY year, P_BRAND;

    Q2.3

    SELECT sum(LO_REVENUE), toYear(LO_ORDERDATE) AS year, P_BRAND FROM lineorder_flat WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE' GROUP BY year, P_BRAND ORDER BY year, P_BRAND;

    Q3.1

    SELECT C_NATION, S_NATION, toYear(LO_ORDERDATE) AS year, sum(LO_REVENUE) AS revenue FROM lineorder_flat WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year = 1992 AND year = 1992 AND year

  • $ for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sSL "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt$ cat links.txt | while read link; do wget http://dumps.wikimedia.org/other/pagecounts-raw/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1/')/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1-\2/')/$link; done$ ls -1 /opt/wikistat/ | grep gz | while read i; do echo $i; gzip -cd /opt/wikistat/$i | ./wikistat-loader --time="$(echo -n $i | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})([0-9]{2})-([0-9]{2})([0-9]{2})([0-9]{2})\.gz/\1-\2-\3 \4-00-00/')" | clickhouse-client --query="INSERT INTO wikistat FORMAT TabSeparated"; done

    原始文章

    Terabyte of Click Logs from Criteo 可以从 http://labs.criteo.com/downloads/download-terabyte-click-logs/ 上下载数据

    创建原始数据对应的表结构:

    CREATE TABLE criteo_log (date Date, clicked UInt8, int1 Int32, int2 Int32, int3 Int32, int4 Int32, int5 Int32, int6 Int32, int7 Int32, int8 Int32, int9 Int32, int10 Int32, int11 Int32, int12 Int32, int13 Int32, cat1 String, cat2 String, cat3 String, cat4 String, cat5 String, cat6 String, cat7 String, cat8 String, cat9 String, cat10 String, cat11 String, cat12 String, cat13 String, cat14 String, cat15 String, cat16 String, cat17 String, cat18 String, cat19 String, cat20 String, cat21 String, cat22 String, cat23 String, cat24 String, cat25 String, cat26 String) ENGINE = Log

    下载数据:

    $ for i in {00..23}; do echo $i; zcat datasets/criteo/day_${i#0}.gz | sed -r 's/^/2000-01-'${i/00/24}'\t/' | clickhouse-client --host=example-perftest01j --query="INSERT INTO criteo_log FORMAT TabSeparated"; done

    创建转换后的数据对应的表结构:

    CREATE TABLE criteo( date Date, clicked UInt8, int1 Int32, int2 Int32, int3 Int32, int4 Int32, int5 Int32, int6 Int32, int7 Int32, int8 Int32, int9 Int32, int10 Int32, int11 Int32, int12 Int32, int13 Int32, icat1 UInt32, icat2 UInt32, icat3 UInt32, icat4 UInt32, icat5 UInt32, icat6 UInt32, icat7 UInt32, icat8 UInt32, icat9 UInt32, icat10 UInt32, icat11 UInt32, icat12 UInt32, icat13 UInt32, icat14 UInt32, icat15 UInt32, icat16 UInt32, icat17 UInt32, icat18 UInt32, icat19 UInt32, icat20 UInt32, icat21 UInt32, icat22 UInt32, icat23 UInt32, icat24 UInt32, icat25 UInt32, icat26 UInt32) ENGINE = MergeTree(date, intHash32(icat1), (date, intHash32(icat1)), 8192)

    将第一张表中的原始数据转化写入到第二张表中去:

    INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;

    DROP TABLE criteo_log;

    原始文章

    AMPLab Big Data Benchmark 参考 https://amplab.cs.berkeley.edu/benchmark/

    需要您在Amazon注册一个免费的账号。注册时需要您提供信用卡、邮箱、电话等信息。之后可以在Amazon AWS Console获取新的访问密钥

    https://clickhouse.tech/docs/en/getting_started/example_datasets/wikistat/http://labs.criteo.com/downloads/download-terabyte-click-logs/https://clickhouse.tech/docs/en/getting_started/example_datasets/criteo/https://amplab.cs.berkeley.edu/benchmark/https://aws.amazon.comhttps://console.aws.amazon.com/iam/home?nc2=h_m_sc#security_credential

  • 在控制台运行以下命令:

    $ sudo apt-get install s3cmd$ mkdir tiny; cd tiny;$ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .$ cd ..$ mkdir 1node; cd 1node;$ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .$ cd ..$ mkdir 5nodes; cd 5nodes;$ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .$ cd ..

    在ClickHouse运行如下查询:

    CREATE TABLE rankings_tiny( pageURL String, pageRank UInt32, avgDuration UInt32) ENGINE = Log;

    CREATE TABLE uservisits_tiny( sourceIP String, destinationURL String, visitDate Date, adRevenue Float32, UserAgent String, cCode FixedString(3), lCode FixedString(6), searchWord String, duration UInt32) ENGINE = MergeTree(visitDate, visitDate, 8192);

    CREATE TABLE rankings_1node( pageURL String, pageRank UInt32, avgDuration UInt32) ENGINE = Log;

    CREATE TABLE uservisits_1node( sourceIP String, destinationURL String, visitDate Date, adRevenue Float32, UserAgent String, cCode FixedString(3), lCode FixedString(6), searchWord String, duration UInt32) ENGINE = MergeTree(visitDate, visitDate, 8192);

    CREATE TABLE rankings_5nodes_on_single( pageURL String, pageRank UInt32, avgDuration UInt32) ENGINE = Log;

    CREATE TABLE uservisits_5nodes_on_single( sourceIP String, destinationURL String, visitDate Date, adRevenue Float32, UserAgent String, cCode FixedString(3), lCode FixedString(6), searchWord String, duration UInt32) ENGINE = MergeTree(visitDate, visitDate, 8192);

    回到控制台运行如下命令:

    $ for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done$ for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done$ for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done$ for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done$ for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done$ for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done

    简单的查询示例:

  • SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000

    SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)

    SELECT sourceIP, sum(adRevenue) AS totalRevenue, avg(pageRank) AS pageRankFROM rankings_1node ALL INNER JOIN( SELECT sourceIP, destinationURL AS pageURL, adRevenue FROM uservisits_1node WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')) USING pageURLGROUP BY sourceIPORDER BY totalRevenue DESCLIMIT 1

    原始文章

    纽约出租车数据 纽约市出租车数据有以下两个方式获取:

    从原始数据导入

    下载处理好的数据

    怎样导入原始数据 可以参考 https://github.com/toddwschneider/nyc-taxi-data 和 http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html 中的关于数据集结构描述与数据下载指令说明。

    数据集包含227GB的CSV文件。在1Gbig的带宽下,下载大约需要一个小时这大约需要一个小时的下载时间(从s3.amazonaws.com并行下载时间至少可以缩减一半)。下载时注意损坏的文件。可以检查文件大小并重新下载损坏的文件。

    有些文件中包含一些无效的行,您可以使用如下语句修复他们:

    sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-02.csv > data/yellow_tripdata_2010-02.csv_sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-03.csv > data/yellow_tripdata_2010-03.csv_mv data/yellow_tripdata_2010-02.csv_ data/yellow_tripdata_2010-02.csvmv data/yellow_tripdata_2010-03.csv_ data/yellow_tripdata_2010-03.csv

    然后必须在PostgreSQL中对数据进行预处理。这将创建多边形中选择的点(将地图上的点与纽约市的行政区相匹配),并使用连接将所有数据合并到一个非规范化的平面表中。为此,您需要安装支持PostGIS的PostgreSQL。

    运行initialize_database.sh时要小心,并手动重新检查是否正确创建了所有表。

    在PostgreSQL中处理每个月的数据大约需要20-30分钟,总共大约需要48小时。

    您可以按如下方式检查下载的行数:

    $ time psql nyc-taxi-data -c "SELECT count(*) FROM trips;"### Count 1298979494(1 row)

    real 7m9.164s

    (根据Mark Litwintschik的系列博客报道数据略多余11亿行)

    PostgreSQL处理这些数据大概需要370GB的磁盘空间。

    从PostgreSQL中导出数据:

    https://clickhouse.tech/docs/en/getting_started/example_datasets/amplab_benchmark/https://github.com/toddwschneider/nyc-taxi-datahttp://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html

  • COPY( SELECT trips.id, trips.vendor_id, trips.pickup_datetime, trips.dropoff_datetime, trips.store_and_fwd_flag, trips.rate_code_id, trips.pickup_longitude, trips.pickup_latitude, trips.dropoff_longitude, trips.dropoff_latitude, trips.passenger_count, trips.trip_distance, trips.fare_amount, trips.extra, trips.mta_tax, trips.tip_amount, trips.tolls_amount, trips.ehail_fee, trips.improvement_surcharge, trips.total_amount, trips.payment_type, trips.trip_type, trips.pickup, trips.dropoff,

    cab_types.type cab_type,

    weather.precipitation_tenths_of_mm rain, weather.snow_depth_mm, weather.snowfall_mm, weather.max_temperature_tenths_degrees_celsius max_temp, weather.min_temperature_tenths_degrees_celsius min_temp, weather.average_wind_speed_tenths_of_meters_per_second wind,

    pick_up.gid pickup_nyct2010_gid, pick_up.ctlabel pickup_ctlabel, pick_up.borocode pickup_borocode, pick_up.boroname pickup_boroname, pick_up.ct2010 pickup_ct2010, pick_up.boroct2010 pickup_boroct2010, pick_up.cdeligibil pickup_cdeligibil, pick_up.ntacode pickup_ntacode, pick_up.ntaname pickup_ntaname, pick_up.puma pickup_puma,

    drop_off.gid dropoff_nyct2010_gid, drop_off.ctlabel dropoff_ctlabel, drop_off.borocode dropoff_borocode, drop_off.boroname dropoff_boroname, drop_off.ct2010 dropoff_ct2010, drop_off.boroct2010 dropoff_boroct2010, drop_off.cdeligibil dropoff_cdeligibil, drop_off.ntacode dropoff_ntacode, drop_off.ntaname dropoff_ntaname, drop_off.puma dropoff_puma FROM trips LEFT JOIN cab_types ON trips.cab_type_id = cab_types.id LEFT JOIN central_park_weather_observations_raw weather ON weather.date = trips.pickup_datetime::date LEFT JOIN nyct2010 pick_up ON pick_up.gid = trips.pickup_nyct2010_gid LEFT JOIN nyct2010 drop_off ON drop_off.gid = trips.dropoff_nyct2010_gid) TO '/opt/milovidov/nyc-taxi-data/trips.tsv';

    数据快照的创建速度约为每秒50MB。 在创建快照时,PostgreSQL以每秒约28MB的速度从磁盘读取数据。这大约需要5个小时。 最终生成的TSV文件为590612904969 bytes。

    在ClickHouse中创建临时表:

  • CREATE TABLE trips(trip_id UInt32,vendor_id String,pickup_datetime DateTime,dropoff_datetime Nullable(DateTime),store_and_fwd_flag Nullable(FixedString(1)),rate_code_id Nullable(UInt8),pickup_longitude Nullable(Float64),pickup_latitude Nullable(Float64),dropoff_longitude Nullable(Float64),dropoff_latitude Nullable(Float64),passenger_count Nullable(UInt8),trip_distance Nullable(Float64),fare_amount Nullable(Float32),extra Nullable(Float32),mta_tax Nullable(Float32),tip_amount Nullable(Float32),tolls_amount Nullable(Float32),ehail_fee Nullable(Float32),improvement_surcharge Nullable(Float32),total_amount Nullable(Float32),payment_type Nullable(String),trip_type Nullable(UInt8),pickup Nullable(String),dropoff Nullable(String),cab_type Nullable(String),precipitation Nullable(UInt8),snow_depth Nullable(UInt8),snowfall Nullable(UInt8),max_temperature Nullable(UInt8),min_temperature Nullable(UInt8),average_wind_speed Nullable(UInt8),pickup_nyct2010_gid Nullable(UInt8),pickup_ctlabel Nullable(String),pickup_borocode Nullable(UInt8),pickup_boroname Nullable(String),pickup_ct2010 Nullable(String),pickup_boroct2010 Nullable(String),pickup_cdeligibil Nullable(FixedString(1)),pickup_ntacode Nullable(String),pickup_ntaname Nullable(String),pickup_puma Nullable(String),dropoff_nyct2010_gid Nullable(UInt8),dropoff_ctlabel Nullable(String),dropoff_borocode Nullable(UInt8),dropoff_boroname Nullable(String),dropoff_ct2010 Nullable(String),dropoff_boroct2010 Nullable(String),dropoff_cdeligibil Nullable(String),dropoff_ntacode Nullable(String),dropoff_ntaname Nullable(String),dropoff_puma Nullable(String)) ENGINE = Log;

    接下来,需要将字段转换为更正确的数据类型,并且在可能的情况下,消除NULL。

    $ time clickhouse-client --query="INSERT INTO trips FORMAT TabSeparated" < trips.tsv

    real 75m56.214s

    数据的读取速度为112-140 Mb/秒。通过这种方式将数据加载到Log表中需要76分钟。这个表中的数据需要使用142GB的磁盘空间.

    (也可以直接使用COPY ... TO PROGRAM从Postgres中导入数据)

    数据中所有与天气相关的字段(precipitation……average_wind_speed)都填充了NULL。 所以,我们将从最终数据集中删除它们

    首先,我们使用单台服务器创建表,后面我们将在多台节点上创建这些表。

    创建表结构并写入数据:

    CREATE TABLE trips_mergetreeENGINE = MergeTree(pickup_date, pickup_datetime, 8192)AS SELECT

    trip_id,CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id,toDate(pickup_datetime) AS pickup_date,ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime,toDate(dropoff_datetime) AS dropoff_date,ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag,assumeNotNull(rate_code_id) AS rate_code_id,assumeNotNull(pickup_longitude) AS pickup_longitude,assumeNotNull(pickup_latitude) AS pickup_latitude,assumeNotNull(dropoff_longitude) AS dropoff_longitude,assumeNotNull(dropoff_latitude) AS dropoff_latitude,assumeNotNull(passenger_count) AS passenger_count,assumeNotNull(trip_distance) AS trip_distance,assumeNotNull(fare_amount) AS fare_amount,assumeNotNull(extra) AS extra,assumeNotNull(mta_tax) AS mta_tax,assumeNotNull(tip_amount) AS tip_amount,assumeNotNull(tolls_amount) AS tolls_amount,assumeNotNull(ehail_fee) AS ehail_fee,assumeNotNull(improvement_surcharge) AS improvement_surcharge,assumeNotNull(total_amount) AS total_amount,CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_,assumeNotNull(trip_type) AS trip_type,

  • assumeNotNull(trip_type) AS trip_type,ifNull(toFixedString(unhex(pickup), 25), toFixedString('', 25)) AS pickup,ifNull(toFixedString(unhex(dropoff), 25), toFixedString('', 25)) AS dropoff,CAST(assumeNotNull(cab_type) AS Enum8('yellow' = 1, 'green' = 2, 'uber' = 3)) AS cab_type,

    assumeNotNull(pickup_nyct2010_gid) AS pickup_nyct2010_gid,toFloat32(ifNull(pickup_ctlabel, '0')) AS pickup_ctlabel,assumeNotNull(pickup_borocode) AS pickup_borocode,CAST(assumeNotNull(pickup_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS pickup_boroname,toFixedString(ifNull(pickup_ct2010, '000000'), 6) AS pickup_ct2010,toFixedString(ifNull(pickup_boroct2010, '0000000'), 7) AS pickup_boroct2010,CAST(assumeNotNull(ifNull(pickup_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS pickup_cdeligibil,toFixedString(ifNull(pickup_ntacode, '0000'), 4) AS pickup_ntacode,

    CAST(assumeNotNull(pickup_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'Schuylerville-Throgs Neck-Edgewater Park' = 148, 'Seagate-Coney Island' = 149, 'Sheepshead Bay-Gerritsen Beach-Manhattan Beach' = 150, 'SoHo-TriBeCa-Civic Center-Little Italy' = 151, 'Soundview-Bruckner' = 152, 'Soundview-Castle Hill-Clason Point-Harding Park' = 153, 'South Jamaica' = 154, 'South Ozone Park' = 155, 'Springfield Gardens North' = 156, 'Springfield Gardens South-Brookville' = 157, 'Spuyten Duyvil-Kingsbridge' = 158, 'St. Albans' = 159, 'Stapleton-Rosebank' = 160, 'Starrett City' = 161, 'Steinway' = 162, 'Stuyvesant Heights' = 163, 'Stuyvesant Town-Cooper Village' = 164, 'Sunset Park East' = 165, 'Sunset Park West' = 166, 'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill' = 167, 'Turtle Bay-East Midtown' = 168, 'University Heights-Morris Heights' = 169, 'Upper East Side-Carnegie Hill' = 170, 'Upper West Side' = 171, 'Van Cortlandt Village' = 172, 'Van Nest-Morris Park-Westchester Square' = 173, 'Washington Heights North' = 174, 'Washington Heights South' = 175, 'West Brighton' = 176, 'West Concourse' = 177, 'West Farms-Bronx River' = 178, 'West New Brighton-New Brighton-St. George' = 179, 'West Village' = 180, 'Westchester-Unionport' = 181, 'Westerleigh' = 182, 'Whitestone' = 183, 'Williamsbridge-Olinville' = 184, 'Williamsburg' = 185, 'Windsor Terrace' = 186, 'Woodhaven' = 187, 'Woodlawn-Wakefield' = 188, 'Woodside' = 189, 'Yorkville' = 190, 'park-cemetery-etc-Bronx' = 191, 'park-cemetery-etc-Brooklyn' = 192, 'park-cemetery-etc-Manhattan' = 193, 'park-cemetery-etc-Queens' = 194, 'park-cemetery-etc-Staten Island' = 195)) AS pickup_ntaname,

    toUInt16(ifNull(pickup_puma, '0')) AS pickup_puma,

    assumeNotNull(dropoff_nyct2010_gid) AS dropoff_nyct2010_gid,toFloat32(ifNull(dropoff_ctlabel, '0')) AS dropoff_ctlabel,assumeNotNull(dropoff_borocode) AS dropoff_borocode,CAST(assumeNotNull(dropoff_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS dropoff_boroname,toFixedString(ifNull(dropoff_ct2010, '000000'), 6) AS dropoff_ct2010,toFixedString(ifNull(dropoff_boroct2010, '0000000'), 7) AS dropoff_boroct2010,CAST(assumeNotNull(ifNull(dropoff_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS dropoff_cdeligibil,toFixedString(ifNull(dropoff_ntacode, '0000'), 4) AS dropoff_ntacode,

    CAST(assumeNotNull(dropoff_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'Schuylerville-Throgs Neck-Edgewater Park' = 148, 'Seagate-Coney Island' = 149, 'Sheepshead Bay-Gerritsen Beach-Manhattan Beach' = 150, 'SoHo-TriBeCa-Civic Center-Little Italy' = 151, 'Soundview-Bruckner' = 152, 'Soundview-Castle Hill-Clason Point-Harding Park' = 153, 'South Jamaica' = 154, 'South Ozone Park' = 155, 'Springfield Gardens North' = 156, 'Springfield Gardens South-Brookville' = 157, 'Spuyten Duyvil-Kingsbridge' = 158, 'St. Albans' = 159, 'Stapleton-Rosebank' = 160, 'Starrett City' = 161, 'Steinway' = 162, 'Stuyvesant Heights' = 163, 'Stuyvesant Town-Cooper Village' = 164, 'Sunset Park East' = 165, 'Sunset Park West' = 166, 'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill' = 167, 'Turtle Bay-East Midtown' = 168, 'University Heights-Morris Heights' = 169, 'Upper East Side-Carnegie Hill' = 170, 'Upper West Side' = 171, 'Van Cortlandt Village' = 172, 'Van Nest-Morris Park-Westchester Square' = 173, 'Washington Heights North' = 174, 'Washington Heights South' = 175, 'West Brighton' = 176, 'West Concourse' = 177, 'West Farms-Bronx River' = 178, 'West New Brighton-New Brighton-St. George' = 179, 'West Village' = 180, 'Westchester-Unionport' = 181, 'Westerleigh' = 182, 'Whitestone' = 183, 'Williamsbridge-Olinville' = 184, 'Williamsburg' = 185, 'Windsor Terrace' = 186, 'Woodhaven' = 187, 'Woodlawn-Wakefield' = 188, 'Woodside' = 189, 'Yorkville' = 190, 'park-cemetery-etc-Bronx' = 191, 'park-cemetery-etc-Brooklyn' = 192, 'park-cemetery-etc-Manhattan' = 193, 'park-cemetery-etc-Queens' = 194, 'park-cemetery-etc-Staten Island' = 195)) AS dropoff_ntaname,

    toUInt16(ifNull(dropoff_puma, '0')) AS dropoff_puma

    FROM trips

    这需要3030秒,速度约为每秒428,000行。要加快速度,可以使用Log引擎替换MergeTree引擎来创建表。 在这种情况下,下载速度超过200秒。

    这个表需要使用126GB的磁盘空间。

  • SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE table = 'trips_mergetree' AND active

    ┌─formatReadableSize(sum(bytes))─┐│ 126.18 GiB │└────────────────────────────────┘

    除此之外,你还可以在MergeTree上运行OPTIMIZE查询来进行优化。但这不是必须的,因为即使在没有进行优化的情况下它的表现依然是很好的。

    下载预处理好的分区数据

    $ curl -O https://datasets.clickhouse.tech/trips_mergetree/partitions/trips_mergetree.tar$ tar xvf trips_mergetree.tar -C /var/lib/clickhouse # path to ClickHouse data directory$ # check permissions of unpacked data, fix if required$ sudo service clickhouse-server restart$ clickhouse-client --query "select count(*) from datasets.trips_mergetree"

    单台服务器运行结果 Q1:

    SELECT cab_type, count(*) FROM trips_mergetree GROUP BY cab_type

    0.490秒

    Q2:

    SELECT passenger_count, avg(total_amount) FROM trips_mergetree GROUP BY passenger_count

    1.224秒

    Q3:

    SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetree GROUP BY passenger_count, year

    2.104秒

    Q4:

    SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)FROM trips_mergetreeGROUP BY passenger_count, year, distanceORDER BY year, count(*) DESC

    3.593秒

    我们使用的是如下配置的服务器:

    两个Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz,总共有16个物理内核,128GiB RAM,8X6TB HD,RAID-5

    执行时间是取三次运行中最好的值,但是从第二次查询开始,查询就将从文件系统的缓存中读取数据。同时在每次读取和处理后不在进行缓存。

    在三台服务器中创建表结构:

    在每台服务器中运行:

    信息如果要运行下面的SQL查询,必须使用完整的表名,datasets.trips_mergetree。

  • CREATE TABLE default.trips_mergetree_third ( trip_id UInt32, vendor_id Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14), pickup_date Date, pickup_datetime DateTime, dropoff_date Date, dropoff_datetime DateTime, store_and_fwd_flag UInt8, rate_code_id UInt8, pickup_longitude Float64, pickup_latitude Float64, dropoff_longitude Float64, dropoff_latitude Float64, passenger_count UInt8, trip_distance Float64, fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, ehail_fee Float32, improvement_surcharge Float32, total_amount Float32, payment_type_ Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4), trip_type UInt8, pickup FixedString(25), dropoff FixedString(25), cab_type Enum8('yellow' = 1, 'green' = 2, 'uber' = 3), pickup_nyct2010_gid UInt8, pickup_ctlabel Float32, pickup_borocode UInt8, pickup_boroname Enum8('' = 0, 'Manhattan' = 1, 'Bronx' = 2, 'Brooklyn' = 3, 'Queens' = 4, 'Staten Island' = 5), pickup_ct2010 FixedString(6), pickup_boroct2010 FixedString(7), pickup_cdeligibil Enum8(' ' = 0, 'E' = 1, 'I' = 2), pickup_ntacode FixedString(4), pickup_ntaname Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'Schuylerville-Throgs Neck-Edgewater Park' = 148, 'Seagate-Coney Island' = 149, 'Sheepshead Bay-Gerritsen Beach-Manhattan Beach' = 150, 'SoHo-TriBeCa-Civic Center-Little Italy' = 151, 'Soundview-Bruckner' = 152, 'Soundview-Castle Hill-Clason Point-Harding Park' = 153, 'South Jamaica' = 154, 'South Ozone Park' = 155, 'Springfield Gardens North' = 156, 'Springfield Gardens South-Brookville' = 157, 'Spuyten Duyvil-Kingsbridge' = 158, 'St. Albans' = 159, 'Stapleton-Rosebank' = 160, 'Starrett City' = 161, 'Steinway' = 162, 'Stuyvesant Heights' = 163, 'Stuyvesant Town-Cooper Village' = 164, 'Sunset Park East' = 165, 'Sunset Park West' = 166, 'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill' = 167, 'Turtle Bay-East Midtown' = 168, 'University Heights-Morris Heights' = 169, 'Upper East Side-Carnegie Hill' = 170, 'Upper West Side' = 171, 'Van Cortlandt Village' = 172, 'Van Nest-Morris Park-Westchester Square' = 173, 'Washington Heights North' = 174, 'Washington Heights South' = 175, 'West Brighton' = 176, 'West Concourse' = 177, 'West Farms-Bronx River' = 178, 'West New Brighton-New Brighton-St. George' = 179, 'West Village' = 180, 'Westchester-Unionport' = 181, 'Westerleigh' = 182, 'Whitestone' = 183, 'Williamsbridge-Olinville' = 184, 'Williamsburg' = 185, 'Windsor Terrace' = 186, 'Woodhaven' = 187, 'Woodlawn-Wakefield' = 188, 'Woodside' = 189, 'Yorkville' = 190, 'park-cemetery-etc-Bronx' = 191, 'park-cemetery-etc-Brooklyn' = 192, 'park-cemetery-etc-Manhattan' = 193, 'park-cemetery-etc-Queens' = 194, 'park-cemetery-etc-Staten Island' = 195), pickup_puma UInt16, dropoff_nyct2010_gid UInt8, dropoff_ctlabel Float32, dropoff_borocode UInt8, dropoff_boroname Enum8('' = 0, 'Manhattan' = 1, 'Bronx' = 2, 'Brooklyn' = 3, 'Queens' = 4, 'Staten Island' = 5), dropoff_ct2010 FixedString(6), dropoff_boroct2