|
| 1 | +ivory |
| 2 | +===== |
| 3 | + |
| 4 | +``` |
| 5 | +ivory: (the ivories) the keys of a piano |
| 6 | +``` |
| 7 | + |
| 8 | +Overview |
| 9 | +-------- |
| 10 | + |
| 11 | +*ivory* defines a specification for how to store feature data and provides a set of tools for |
| 12 | +querying it. It does not provide any tooling for producing feature data in the first place. |
| 13 | +All ivory commands run as MapReduce jobs so it assumed that feature data is maintained on HDFS. |
| 14 | + |
| 15 | + |
| 16 | +Repository |
| 17 | +---------- |
| 18 | + |
| 19 | +The tooling provided operates on an *ivory repository*. An ivory repository is a convention for |
| 20 | +storing *fact sets*, feature *dictionaries* and *feature stores*. The directory structure is |
| 21 | +as follows: |
| 22 | + |
| 23 | +``` |
| 24 | +ivory_repository/ |
| 25 | +├── metadata |
| 26 | +│ ├── dictionaries |
| 27 | +│ │ ├── dictionary1 |
| 28 | +│ │ └── dictionary2 |
| 29 | +│ └── stores |
| 30 | +│ ├── feature_store1 |
| 31 | +│ └── feature_store2 |
| 32 | +└── fact_sets |
| 33 | + ├── fact_set1 |
| 34 | + └── fact_set2 |
| 35 | +``` |
| 36 | + |
| 37 | + |
| 38 | +Fact Sets |
| 39 | +--------- |
| 40 | + |
| 41 | +A *fact set* is a single directory containing multiple *facts*, where a *fact* defines: |
| 42 | + |
| 43 | +1. The *entity* the feature value is associated with; |
| 44 | +2. An *attribute* specifying which feature; |
| 45 | +4. The *value* itself; |
| 46 | +3. The *time* from which the feature value is valid. |
| 47 | + |
| 48 | +That is, a fact is simply an *EAVT* record. Multiple facts form a fact set, which is described within |
| 49 | +*EAVT* files that are partitioned by *namespace* and *date*. For example: |
| 50 | + |
| 51 | +``` |
| 52 | +my_fact_set/ |
| 53 | +├── widgets |
| 54 | +│ └── 2014 |
| 55 | +│ └── 01 |
| 56 | +│ ├── 09 |
| 57 | +│ │ ├── eavt-00000 |
| 58 | +│ │ ├── eavt-00001 |
| 59 | +│ │ └── eavt-00002 |
| 60 | +│ ├── 10 |
| 61 | +│ │ ├── eavt-00000 |
| 62 | +│ │ ├── eavt-00001 |
| 63 | +│ │ └── eavt-00002 |
| 64 | +│ └── 11 |
| 65 | +│ ├── eavt-00000 |
| 66 | +│ ├── eavt-00001 |
| 67 | +│ └── eavt-00002 |
| 68 | +└── demo |
| 69 | + └── 2013 |
| 70 | + └── 01 |
| 71 | + └── 09 |
| 72 | + ├── eavt-00000 |
| 73 | + └── eavt-00001 |
| 74 | +
|
| 75 | +``` |
| 76 | + |
| 77 | +In this fact set, facts are partioned across two namespaces: `widgets` and `demo`. The *widget* facts |
| 78 | +are spread accross three dates, while *demographic* facts are constrained to one. Note also that |
| 79 | +a given namespace-partition can contain multiple EAVT files. |
| 80 | + |
| 81 | +EAVT files are simply pipe-delimited text files with one EAVT record per line. For example, a line in |
| 82 | +the file `my_fact_set/widgets/2013/01/10/eavt-00001` might look like: |
| 83 | + |
| 84 | +``` |
| 85 | +928340|inbound.count.1W|35|43200 |
| 86 | +``` |
| 87 | + |
| 88 | +That is, the fact: "feature attribute `inbound.count.1W` has value `35` for entity 928340 as of |
| 89 | +10/01/2014 12:00". The time component of the record is the number of seconds into the day specified by |
| 90 | +the partition the record belongs to. Note that Ivory does not enforce or specify a time zone for the time component |
| 91 | +of a fact. A time zone should be chosen that is reflective of the domain, however, for a given Ivory |
| 92 | +feature store, the time zone for all facts should be the same. |
| 93 | + |
| 94 | + |
| 95 | +Feature store |
| 96 | +------------- |
| 97 | + |
| 98 | +A feature store is comprised of one or more *fact sets*, which is represented by a text file containing |
| 99 | +an ordered list of references to fact sets. For example: |
| 100 | + |
| 101 | +``` |
| 102 | +00005 |
| 103 | +00004 |
| 104 | +00003 |
| 105 | +widget_fixed |
| 106 | +00002 |
| 107 | +00001 |
| 108 | +00000 |
| 109 | +``` |
| 110 | + |
| 111 | +The ordering is important as it allows facts to be overriden. When a feature store is queried, if multiple facts |
| 112 | +with the same entity, attribute and time are identified, the value from the fact contained in the most recent fact |
| 113 | +set will be used, where most recent means listed higher in the feature store file. |
| 114 | + |
| 115 | +Because a feature store can be speified by just referencing fact sets, Ivory can support poor-man versioning giving |
| 116 | +rise to use cases such as: |
| 117 | + |
| 118 | +* overrding buggy values with corrected ones; |
| 119 | +* combining *production* features with *ad-hoc* features. |
| 120 | + |
| 121 | + |
| 122 | +Dictionary |
| 123 | +---------- |
| 124 | + |
| 125 | +All features are identified by their name and namespace. In the example fact above, the feature is `widgets:inbound.count.1W` |
| 126 | +where `widgets` is the namespace and `inbound.count.1W` is the name. With Ivory we must also associate with any namespace-name |
| 127 | +feature identifier the following metadata: |
| 128 | + |
| 129 | +* An *encoding* specifying the type encoding of a feature value: |
| 130 | + * `boolean` |
| 131 | + * `int` |
| 132 | + * `double` |
| 133 | + * `string` |
| 134 | + |
| 135 | +* A *classification* type specifying how a feature value can be semantically interpreted and used: |
| 136 | + * `numerical` |
| 137 | + * `categorical` |
| 138 | + |
| 139 | +* A human-readable *description*. |
| 140 | + |
| 141 | +In Ivory, feature metadata is seperated from the features store (facts) in its own set of text files known |
| 142 | +as *feature dictionaries*. Dictionary text files are also pipe-delimited and of the following form: |
| 143 | + |
| 144 | +``` |
| 145 | +namespace|name|encoding|type|description |
| 146 | +``` |
| 147 | + |
| 148 | +So for the fact above, we could specify a dictionary entry such as: |
| 149 | + |
| 150 | +``` |
| 151 | +widgets|inbound.count.1W|int|numerical|Count in the last week |
| 152 | +``` |
| 153 | + |
| 154 | +Other dictionary entries might look like the following: |
| 155 | + |
| 156 | +``` |
| 157 | +demo|gender|string|categorical|Gender |
| 158 | +demo|postcode|string|categorical|Postcode |
| 159 | +``` |
| 160 | + |
| 161 | +Given a dictionary, we can use Ivory to validate facts in a feature store. The `validate` command will |
| 162 | +check that the encoding types specified for features in the dictionary are consistent with facts: |
| 163 | + |
| 164 | +``` |
| 165 | +> ivory validate --feature-store feature_store.txt --dictionary feature_dictionary.txt |
| 166 | +``` |
| 167 | + |
| 168 | +We can also use Ivory to generate statistics for the values of specific features accross a feature store using the |
| 169 | +`inspect` command. This will compute statistics such as density, ranges (for numerical features), factors (for |
| 170 | +categorical features), historgrams, means, etc. Inspections can filter both the features of interest as well which |
| 171 | +facts to considered by time: |
| 172 | + |
| 173 | +``` |
| 174 | +> ivory inspect --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --start-time '2013-01-01' --end-time '2014-01-01' |
| 175 | +``` |
| 176 | + |
| 177 | + |
| 178 | +Querying |
| 179 | +-------- |
| 180 | + |
| 181 | +Ivory supports two types of queries: *snapshots* and *chords*. |
| 182 | + |
| 183 | + |
| 184 | +A `snaphot` query is used to extract the feature values for entities at a certain point in time. Snapshoting can filter |
| 185 | +the set of features and/or entities considered. By default the output is in *EAVT* format, but can be output in |
| 186 | +row-oriented form (i.e. column per feature) using the `--pivot` option. When a snapshot` query is performed, the most |
| 187 | +recent feature value for a given entity-attribute, with respect to the snapshot time, will be returned in the output: |
| 188 | + |
| 189 | +``` |
| 190 | +# get a snapshot of values for specific features and entities as of 1 Nov 2013 |
| 191 | +> ivory snapshot --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --entities entities.txt --time '2013-11-01' --output nov2013snapshot |
| 192 | +
|
| 193 | +# Pivot the table to be row oriented |
| 194 | +> ivory snapshot --pivot --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --entities entities.txt --time '2013-11-01' --output nov2013snapshot |
| 195 | +``` |
| 196 | + |
| 197 | +A `chord` query is used to extract the feature values for entities at different points in time - *instances*. That is, for each entity, a |
| 198 | +different time is specified. In fact, multiple times can be specified per entity. To invoke a chord query, a file of *instance |
| 199 | +descriptors* must be specified that are entity-time pairs, for example: |
| 200 | + |
| 201 | +``` |
| 202 | +928340|2013-11-01 |
| 203 | +928340|2013-11-08 |
| 204 | +928316|2013-11-08 |
| 205 | +928316|2013-11-15 |
| 206 | +``` |
| 207 | + |
| 208 | +Like `snapshot`, `chord` by default will output in *EAVT* format, but can be output in row-oriented form using the `--pivot` option: |
| 209 | + |
| 210 | +``` |
| 211 | +> ivory chord --feature-store feature_store.txt --dictionary feature_dictionary.txt --instances instances.txt --output nov2013snapshot |
| 212 | +
|
| 213 | +> ivory chord --pivot --feature-store feature_store.txt --dictionary feature_dictionary.txt --instances instances.txt --output nov2013snapshot |
| 214 | +``` |
| 215 | + |
| 216 | + |
| 217 | +Data Generation |
| 218 | +--------------- |
| 219 | + |
| 220 | +Ivory supports generating random dictionaries and fact sets which can be used for testing. |
| 221 | + |
| 222 | +To generate a random dictionary, you need to specify the number of features, and an output location: |
| 223 | + |
| 224 | +``` |
| 225 | +> ivory generate-dictionary --features 10000 --output random_dictionary |
| 226 | +``` |
| 227 | + |
| 228 | +This outputs two files: |
| 229 | + |
| 230 | +1. the dictionary itself. |
| 231 | +2. a feature flag file specifying the sparcity and frequency of each feature, where sparcity is a double between 0.0 and 1.0, and frequency is one of `daily`, `weekly`, or `monthly`. |
| 232 | + |
| 233 | +The format of the feature flag file is: |
| 234 | + |
| 235 | +``` |
| 236 | +namespace|name|sparcity|fequency |
| 237 | +``` |
| 238 | + |
| 239 | +An example is: |
| 240 | +``` |
| 241 | +widgets|inbound.count.1W|0.4|weekly |
| 242 | +demo|postcode|0.7|monthly |
| 243 | +``` |
| 244 | + |
| 245 | +To generate random facts, you need to specify a dictionary, feature flag file, number of entities, time range, number of fact sets, and output location: |
| 246 | + |
| 247 | +``` |
| 248 | +> ivory generate-facts --dictionary feature_dictionary.txt --flags feature_flags.txt --entities 10000000 --time-range 2012-01-01_to_2012-12-31 --factsets 3 --output random_factsets |
| 249 | +``` |
| 250 | + |
| 251 | +This outputs a fact set partitioned by *namespace* and *date* so it can be used as part of a feature store. |
| 252 | + |
| 253 | + |
| 254 | + |
| 255 | +Versioning |
| 256 | +---------- |
| 257 | + |
| 258 | +The format of fact sets are versioned. This allows the format of fact sets to be modified in the future but still maintain feature stores that |
| 259 | +reference fact sets persisted in an older format. |
| 260 | + |
| 261 | +A fact set format version is specifed by a `.version` file that is stored at the root directory of a given fact set. |
0 commit comments