Upload
brunhild-welder
View
108
Download
0
Embed Size (px)
Citation preview
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Two Issues on Remote Data Access
- Automation of Disclosure Limitation Review
- Memory Allocation
Peter Jacobebbinghaus
5th German Stata User's Group Meeting
April 2nd, 2007 at RWI-Essen
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Research Data Centre of the BA in the IAB in Nuremberg (FDZ)
Provides access to micro data of the
Institute for Employment Research (IAB)
Federal Employment Agency (BA)
Data
surveys
administrational (databases > 200 GB, samples ~ 2 GB)
Data protection and data access
anonymization (scientific use files)
on-site use
remote data access
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Remote data access
Researchers send do-files – we return log-files
90% use Stata
Disclosure limitation review
Data have detailed information on persons and firms
Neither firms nor persons may be identifiable by the contents of the log-files
One heuristic: delete all information that is based on few observations
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Example
. table sex edu
----------------------------------
| Education
Sex | low medium high
----------+-----------------------
male | 1 6 4
female | 8 53 26
----------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Example
. table sex edu
----------------------------------
| Education
Sex | low medium high
----------+-----------------------
male | / 6 4
female | 8 53 26
----------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Example
. table sex edu, c(freq mean wage)
-------------------------------------
| Education
Sex | low medium high
----------+--------------------------
male | 1 6 4
| 137.973 137.204 117.908
|
female | 8 53 26
| 125.450 140.083 134.114
-------------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Example
. table sex edu, c(freq mean wage)
-------------------------------------
| Education
Sex | low medium high
----------+--------------------------
male | / 6 4
| / 137.204 117.908
|
female | 8 53 26
| 125.450 140.083 134.114
-------------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Disclosure limitation review
Problems
a lot of work if done by hand!
Pearl scripts don‘t get all
• number of firms not visible
• weights
Solutions
use other software ;-)
„fdz-ados“
• -tablefdz-
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
tablefdz
sxntax very similar to -table-
program define tablefdz, byable(recall)
version 9
syntax varlist(min=1 max=2) [if] [in] [pweight /] ,FIRMID(string) [Contents(string) Format(string)]
error messages
preserve
drop all needless variables and observations
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
by `vars' `firmid': gen `dum' = ( _n == 1 )
by `vars': egen `anzbetr' = sum(`dum')
by `vars': egen `summe' = sum( `eins' )
forv i = 1/`anz_stats' {
if "`weight'" != "" & "`stat`i''" == "mean" {
by `vars': egen `sum_gew' = sum( `svar`i'' )
by `vars': gen `sv`i'' = `sum_gew' / `summe'
}
else {
by `vars': egen `sv`i'' = `stat`i''( `svar`i'' )
}
}
by `vars': keep if _n == 1
forvalues i = 1/`anz_stats' {
replace `sv`i'' = -2 if `anzbetr' <= `minoffirms'
}
replace `summe' = -2 if `anzbetr' <= `minoffirms'
save `cell'
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
tablefdz
calculation done for inner table, row, col and total
results for inner table, row, col and total are appended
tabdisp
restore
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Results of tablefdz
. tablefdz sex edu, firmid(firmid)
------------------------------------------
| Education
Sex | low medium high Total
----------+-------------------------------
male | -1 -1 4 13
female | 8 53 26 87
|
Total | 11 59 30 100
------------------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
. tablefdz sex edu, firmid(firmid) c(mean wage sum wage)
--------------------------------------------------
| Education
Sex | low medium high Total
----------+---------------------------------------
male | -1 -1 4 13
| -1 -1 117.9078 131.4443
| -1 -1 471.6311 1708.776
|
female | 8 53 26 87
| 125.4497 140.0829 134.1142 136.9536
| 1003.598 7424.396 3486.968 11914.96
|
Total | 11 59 30 100
| 128.8651 139.7902 131.9533 136.2374
| 1417.516 8247.622 3958.599 13623.74
--------------------------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
. tablefdz sex edu [pw= weight], c(mean wage sum wage) firmid(firmid)
--------------------------------------------------
| Education
Sex | low medium high Total
----------+---------------------------------------
male | -1 -1 115.1209 431.6714
| -1 -1 115.558 131.4435
| -1 -1 13303.14 56740.38
|
female | 349.4451 2696.87 1252.799 4299.114
| 133.8411 133.9306 134.1103 133.9757
| 46770.11 361193.5 168013.3 575976.9
|
Total | 497.2254 2865.64 1367.92 4730.786
| 133.3649 134.3813 132.549 133.7446
| 66312.39 385088.5 181316.4 632717.3
--------------------------------------------------
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
tablefdz
deleted values can somtimes be recalculated very easily
more deletion and
rounding by Pearl still necessary but now it is feasible
tablefdz takes about twice the time as table
Questions
ideas to improve? (computation time)
similar programs around?
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
Topic 2: Memory usage
16 gigabyte memory in Unix server, 15 gigabyte are free
How much memory is the optimum we should allocate?
13? 15?
dependant on matsize?
rule-of-thumb?
Automation for running Stata-jobs?
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
* Program to produce tables with automatically deleted values that are based on few observations
* by Stefan Seth and Peter Jacobebbinghaus
*! version 1.0.1 30mar2007
program define tablefdz, byable(recall)
version 9
syntax varlist(min=1 max=2) [if] [in] [pweight /] , FIRMID(string) [Contents(string) Format(string)]
tempvar anzbetr sv1 sv2 sv3 sv4 sv5 dum svar1 svar2 svar3 svar4 svar5 eins summe
tempvar sum_gew1 sum_gew2 sum_gew3 sum_gew4 sum_gew5 // used when calculating weighted averages
tempfile dat cross row col scol
local minoffirms = 3 // min. number of firms per statistic
local anzvar = wordcount("`varlist'")
local fmt "format(`format')"
preserve
if _by() {
marksample touse // take care of by-group
}
quietly {
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
if "`contents'" == "" {
local anz_stats = 0
}
if "`contents'" != "" {
if mod( wordcount( "`contents'" ), 2 ) != 0 {
di in red "Invalid statistics specification; please use the Contents option as follows:"
di in red "contents( statistic variable )"
di in red "You may specify up to 5 statistic/variable-pairs, e.g.: "
di in red "tabfdz var1 var2, contents( mean var3 sum var4 )
exit
}
tokenize "`contents'"
local stats "`1' `3' `5' `7' `9'"
local svar "`2' `4' `6' `8' `10'"
tokenize "`stats'"
local anz_stats = wordcount( "`stats'" ) // list of statistics
forvalues i = 1/`anz_stats' {
local stat`i' "``i''"
}
tokenize "`svar'" // list of variables
forvalues i = 1/`anz_stats' {
local svar`i' "``i''"
}
}
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
*------------------------------------------------------------ error messages
if "`weight'" != "" {
forv i = 1/`anz_stats' {
if inlist( "`stat`i''", "sum", "mean" ) == 0 {
di in red "weights may only be used with 'sum' and 'mean'"
exit
}
}
if "`weight'" != "pweight" {
di in red "only pweights allowed"
exit
}
}
forvalues i = 1/`anz_stats' {
if inlist( "`stat`i''", "sum", "mean", "sd", "median", "min", "max" ) == 0 {
di in red "Statistics sum mean sd median min max allowed only"
exit
}
}
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
*------------------------------------------------------------ reduce size of data
if _by() {
keep if `touse' // restrict to by-group
}
capture keep `if' `in' // reduction of observations
keep `firmid' `varlist' `svar' `exp' // reduction of variables
gen byte `eins' = 1
tokenize `varlist' // dropping missings
drop if missing( `1' )
if `anzvar' == 2 {
drop if missing( `2' )
}
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
if "`weight'" != "" {
replace `eins' = `exp' // multiply statistics variables with weights...
forv i = 1 / `anz_stats' {
local jmax = `i' - 1
local seen = 0 // ... but only once
forv j = 1 / `jmax' {
if "`svar`i''" == "`svar`j''" {
local seen = 1
}
}
if `seen' == 0 {
replace `svar`i'' = `svar`i'' * `exp'
}
}
}
save `dat'
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
*------------------------------------------------------------ computation of statistics
local vars "`varlist'" // inner part of the table
sort `vars' `firmid'
by `vars' `firmid': gen `dum' = ( _n == 1 ) // mark obs with new firm
by `vars': egen `anzbetr' = sum(`dum') // number of firms per cell
by `vars': egen `summe' = sum( `eins' ) // (weighted) number of persons per cell
forv i = 1 / `anz_stats' {
if "`weight'" != "" & "`stat`i''" == "mean" { // mean with weights per hand
by `vars': egen `sum_gew`i'' = sum( `svar`i'' )
by `vars': gen `sv`i'' = `sum_gew`i''/`summe' // mean = sum(y_i * h_i)/sum(h_i)
}
else {
by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) // other statistics per person
}
}
by `vars': keep if _n == 1 // keep one line per cell
forv i = 1 / `anz_stats' {
replace `sv`i'' = -1 if `anzbetr' <= `minoffirms' // erase statistics based on few firms
}
replace `summe' = -1 if `anzbetr' <= `minoffirms' // erase number of cases based on few firms
save `cross' // save results for the inner part
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
if `anzvar' == 2 {
use `dat', clear // the same for rows
tokenize `varlist'
local vars "`1'"
replace `2' = .
sort `vars' `firmid'
by `vars' `firmid': gen `dum' = ( _n == 1 )
by `vars': egen `anzbetr' = sum(`dum')
by `vars': egen `summe' = sum( `eins' )
forv i = 1/`anz_stats' {
if "`weight'" != "" & "`stat`i''" == "mean" {
by `vars': egen `sum_gew`i'' = sum( `svar`i'' )
by `vars': gen `sv`i'' = `sum_gew`i'' / `summe'
}
else {
by `vars': egen `sv`i'' = `stat`i''( `svar`i'' )
}
}
by `vars': keep if _n == 1
forvalues i = 1/`anz_stats' {
replace `sv`i'' = -2 if `anzbetr' <= `minoffirms'
}
replace `summe' = -2 if `anzbetr' <= `minoffirms'
save `row'
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
use `dat', clear // the same for columns
tokenize `varlist'
replace `1' = .
local vars "`2'"
sort `vars' `firmid'
by `vars' `firmid': gen `dum' = ( _n == 1 )
by `vars': egen `anzbetr' = sum( `dum' )
by `vars': egen `summe' = sum( `eins' )
forv i = 1/`anz_stats' {
if "`weight'" != "" & "`stat`i''" == "mean" {
by `vars': egen `sum_gew`i'' = sum( `svar`i'' )
by `vars': gen `sv`i'' = `sum_gew`i'' / `summe'
}
else {
by `vars': egen `sv`i'' = `stat`i''( `svar`i'' )
}
}
by `vars': keep if _n == 1
forvalues i = 1/`anz_stats' {
replace `sv`i'' = -3 if `anzbetr' <= `minoffirms'
}
replace `summe' = -3 if `anzbetr' <= `minoffirms'
save `col'
}
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
use `dat', clear // the same for the total
tokenize `varlist'
replace `1' = .
if `anzvar' == 2 {
replace `2' = .
}
sort `firmid'
by `firmid': gen `dum' = ( _n == 1 )
egen `anzbetr' = sum(`dum')
egen `summe' = sum( `eins' )
forv i = 1/`anz_stats' {
if "`weight'" != "" & "`stat`i''" == "mean" {
egen `sum_gew`i'' = sum( `svar`i'' )
gen `sv`i'' = `sum_gew`i'' / `summe'
}
else {
egen `sv`i'' = `stat`i''( `svar`i'' )
}
}
keep if _n == 1
forvalues i = 1/`anz_stats' {
replace `sv`i'' = -4 if `anzbetr' <= `minoffirms'
}
replace `summe' = -4 if `anzbetr' <= `minoffirms'
save `scol'
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
use `cross', clear // appending the results
append using `scol'
if `anzvar' == 2 {
append using `col'
append using `row'
}
*------------------------------------------------------------ format output
forvalues i = 1/`anz_stats' { // headers
local svlist = "`svlist'" + " `" + "sv`i'" + "'"
}
if `anzvar' == 1 & "`contents'" != "" {
label variable `sv1' "`stat1'(`svar1')"
cap label variable `sv2' "`stat2'(`svar2')"
cap label variable `sv3' "`stat3'(`svar3')"
cap label variable `sv4' "`stat4'(`svar4')"
cap label variable `sv5' "`stat5'(`svar5')"
}
label variable `summe' "Freq."
local linesize = c(linesize)
set linesize 255
format `summe' %9.0g
} // end of quietly
http://fdz.iab.de
Forschungsdatenzentrum der Bundesagentur für Arbeitim Institut für Arbeitsmarkt-und Berufsforschung
*------------------------------------------------------------ output
di "/*§BEGIN_tabfdz*/"
tabdisp `varlist', cellvar(`summe' `svlist' ) totals `fmt' csepwidth(2) missing
di "/*§END_tabfdz*/"
restore
set linesize `linesize'
end