Sampling Rows using the SAMPLE Function.pdf

  • Upload
    kh-gh

  • View
    231

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 Sampling Rows using the SAMPLE Function.pdf

    1/5

    4/19/13 Sampling Rows using the SAMPLE Function

    www.coffingdw.com/sql/tdsqlutp/sampling_rows_using_the_sample_function.htm

    Sampling Rows using the SAMPLE Function

    Compatibility: Teradata Extension

    The Sampling function (SAMPLE) permits a SELECT to randomly return rows from

    a Teradata database table. It allows the request to specify either an absolutenumber of rows or a percentage of rows to return. Additionally, it provides anability to return rows from multiple samples.

    The syntax for the SAMPLE function:

    The next SELECT uses the SAMPLE to get a random sample of the sales table:

    5 Rows Returned

    Student_ID Course_ID

    280023 210

    260000 400

    125634 100

    125634 220

    333450 500

    Sometimes, a single sampling of the data is not sufficient. The SAMPLE function

    can be used to request more than one sample by listing either the number ofrows or the percentage of the rows to be returned.

    The next SELECT uses the SAMPLE function to request multiple samples:

    8 Rows Returned

    http://document.ehelpform.submit%28%29/http://www.coffingdw.com/sql/tdsqlutp/random_number_generator_function.htmhttp://www.coffingdw.com/sql/tdsqlutp/numbering_of_the_rows_using_row_number_over.htm
  • 8/14/2019 Sampling Rows using the SAMPLE Function.pdf

    2/5

    4/19/13 Sampling Rows using the SAMPLE Function

    www.coffingdw.com/sql/tdsqlutp/sampling_rows_using_the_sample_function.htm

    Student_ID Course_ID

    123250 100

    125634 100

    125634 220

    231222 220

    260000 400

    280023 210

    322133 300

    333450 500

    Although multiple samples were taken, the rows came back as a single answerset consisting of 50% (.25 + .25) of the data. When it is necessary to determinewhich rows came from which sample, the SAMPLEID column name can be used todistinguish between each sample.

    This SELECT uses the SAMPLE function with the SAMPLEID to request multiplesamples and denote which sample each row came from:

    14 Rows Returned

    Student_ID Course_ID SAMPLEID

    125634 100 1

    125634 220 1

    260000 400 1

    280023 210 1

    333450 500 1

    123250 100 2125634 200 2

    231222 220 2

    322133 220 2

    322133 300 2

    231222 210 3

    234121 100 3

    324652 200 3

  • 8/14/2019 Sampling Rows using the SAMPLE Function.pdf

    3/5

    4/19/13 Sampling Rows using the SAMPLE Function

    www.coffingdw.com/sql/tdsqlutp/sampling_rows_using_the_sample_function.htm

    333450 400 3

    Since the previous request asks for more rows than are currently in the table, awarning message 7473 is received. Regardless, it is only a warning and theSELECT works and all rows are returned. If there is any doubt in the number ofrows, instead of using a fixed number and receiving the warning message, theuse of percentage is a better choice.

    At the same time, you may wish for rows to be available for all samples. Thenext SELECT uses the SAMPLE WITH REPLACEMENT function with the SAMPLEIDto request multiple samples and denote which sample each row came from:

    15 Rows Returned

    Student_ID Course_ID SAMPLEID

    125634 100 1

    125634 220 1

    260000 400 1

    260000 400 1

    333450 500 1123250 100 2

    125634 200 2

    231222 220 2

    322133 220 2

    322133 300 2

    231222 210 3

    231222 220 3

    234121 100 3324652 200 3

    333450 400 3

    The bolded rows came back in a single sample and another in two differentsamples to make 15 rows when the table only contains 14 rows.

    The next SELECT uses the SAMPLE function with the SAMPLEID to requestmultiple samples as a percentage and denotes which sample each row camefrom:

  • 8/14/2019 Sampling Rows using the SAMPLE Function.pdf

    4/5

    4/19/13 Sampling Rows using the SAMPLE Function

    www.coffingdw.com/sql/tdsqlutp/sampling_rows_using_the_sample_function.htm

    4 Rows Returned

    Student_ID Course_ID SAMPLEID

    322133 300 1

    333450 500 2

    280023 210 3

    231222 210 4

    Although 10% of 14 rows is 1.4, it can only return a whole row and therefore, 1row is returned per sample. Also, since SAMPLEID is a column, it can be used as

    the sort key.

    By default the SAMPLE function does a proportional sampling across all AMPs inthe system. Therefore it is not a simple random sample across the entirepopulation of rows. If you wish this to be the case, use the RANDOMIZEDALLOCATION as seen below:

    4 Rows Returned

    Student_ID Course_ID SAMPLEID

    333450 500 3

    324652 200 2

    333450 400 4

    260000 400 1

    Both the WITH REPLACEMENT and RANDOMIZED ALLOCATION can be used in thesame SAMPLE. The OLAP functions provide some very interesting and powerfufunctionality for examining and evaluating data. They provide an insight into thedata that was not easily obtained prior to these functions.

    Another functionality built into the SAMPLE is the conditional data test using theWHEN:

  • 8/14/2019 Sampling Rows using the SAMPLE Function.pdf

    5/5

    4/19/13 Sampling Rows using the SAMPLE Function

    www.coffingdw.com/sql/tdsqlutp/sampling_rows_using_the_sample_function.htm

    4 Rows Returned

    Student_ID Course_ID SAMPLEID

    333450 500 1

    333450 400 2

    324652 200 3

    123250 100 4

    Although they look like Aggregates, they are not normally compatible with them

    in the same SELECT list. As demonstrated here, aggregation can be performedhowever, they must be calculated in a temporary or derived table.

    This next SELECT uses the SAMPLE function to request multiple samples tocreate a derived table (cover later). Then, the unique rows will be counted toshow the random quality of the SAMPLE function:

    1 Row Returned

    count(distinct(course_id))

    4

    A second run of the same SELECT might very well yield these results:

    count(distinct(course_id))

    5