Skip to content

Commit 90bd476

Browse files
committed
Initial oss import.
0 parents  commit 90bd476

File tree

111 files changed

+9344
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+9344
-0
lines changed

.gitignore

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
*.class
2+
*.log
3+
4+
# sbt specific
5+
dist/*
6+
target/
7+
lib_managed/
8+
src_managed/
9+
project/boot/
10+
project/plugins/project/
11+
project/source.scala
12+
13+
# Scala-IDE specific
14+
.scala_dependencies
15+
.idea
16+
*.iml
17+
18+
# vim
19+
*.swp
20+
21+
# Local source linking
22+
project/source.scala

NOTICE.txt

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
====
2+
Copyright 2013,2014 National ICT Australia Limited
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
====
16+
17+
(c) Copyright National ICT Australia Limited (NICTA), 2013,2014
18+
19+
Some files contain other unattributed Contributions to the Work; All Contributions
20+
received from Contributors under the terms of the Apache License Agreement v 2.0 and
21+
re-distributed in accordance with that license.

README.md

+261
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
ivory
2+
=====
3+
4+
```
5+
ivory: (the ivories) the keys of a piano
6+
```
7+
8+
Overview
9+
--------
10+
11+
*ivory* defines a specification for how to store feature data and provides a set of tools for
12+
querying it. It does not provide any tooling for producing feature data in the first place.
13+
All ivory commands run as MapReduce jobs so it assumed that feature data is maintained on HDFS.
14+
15+
16+
Repository
17+
----------
18+
19+
The tooling provided operates on an *ivory repository*. An ivory repository is a convention for
20+
storing *fact sets*, feature *dictionaries* and *feature stores*. The directory structure is
21+
as follows:
22+
23+
```
24+
ivory_repository/
25+
├── metadata
26+
│ ├── dictionaries
27+
│ │ ├── dictionary1
28+
│ │ └── dictionary2
29+
│ └── stores
30+
│ ├── feature_store1
31+
│ └── feature_store2
32+
└── fact_sets
33+
├── fact_set1
34+
└── fact_set2
35+
```
36+
37+
38+
Fact Sets
39+
---------
40+
41+
A *fact set* is a single directory containing multiple *facts*, where a *fact* defines:
42+
43+
1. The *entity* the feature value is associated with;
44+
2. An *attribute* specifying which feature;
45+
4. The *value* itself;
46+
3. The *time* from which the feature value is valid.
47+
48+
That is, a fact is simply an *EAVT* record. Multiple facts form a fact set, which is described within
49+
*EAVT* files that are partitioned by *namespace* and *date*. For example:
50+
51+
```
52+
my_fact_set/
53+
├── widgets
54+
│   └── 2014
55+
│   └── 01
56+
│   ├── 09
57+
│   │   ├── eavt-00000
58+
│   │   ├── eavt-00001
59+
│   │   └── eavt-00002
60+
│   ├── 10
61+
│   │   ├── eavt-00000
62+
│   │   ├── eavt-00001
63+
│   │   └── eavt-00002
64+
│   └── 11
65+
│   ├── eavt-00000
66+
│   ├── eavt-00001
67+
│   └── eavt-00002
68+
└── demo
69+
└── 2013
70+
└── 01
71+
└── 09
72+
├── eavt-00000
73+
└── eavt-00001
74+
75+
```
76+
77+
In this fact set, facts are partioned across two namespaces: `widgets` and `demo`. The *widget* facts
78+
are spread accross three dates, while *demographic* facts are constrained to one. Note also that
79+
a given namespace-partition can contain multiple EAVT files.
80+
81+
EAVT files are simply pipe-delimited text files with one EAVT record per line. For example, a line in
82+
the file `my_fact_set/widgets/2013/01/10/eavt-00001` might look like:
83+
84+
```
85+
928340|inbound.count.1W|35|43200
86+
```
87+
88+
That is, the fact: "feature attribute `inbound.count.1W` has value `35` for entity 928340 as of
89+
10/01/2014 12:00". The time component of the record is the number of seconds into the day specified by
90+
the partition the record belongs to. Note that Ivory does not enforce or specify a time zone for the time component
91+
of a fact. A time zone should be chosen that is reflective of the domain, however, for a given Ivory
92+
feature store, the time zone for all facts should be the same.
93+
94+
95+
Feature store
96+
-------------
97+
98+
A feature store is comprised of one or more *fact sets*, which is represented by a text file containing
99+
an ordered list of references to fact sets. For example:
100+
101+
```
102+
00005
103+
00004
104+
00003
105+
widget_fixed
106+
00002
107+
00001
108+
00000
109+
```
110+
111+
The ordering is important as it allows facts to be overriden. When a feature store is queried, if multiple facts
112+
with the same entity, attribute and time are identified, the value from the fact contained in the most recent fact
113+
set will be used, where most recent means listed higher in the feature store file.
114+
115+
Because a feature store can be speified by just referencing fact sets, Ivory can support poor-man versioning giving
116+
rise to use cases such as:
117+
118+
* overrding buggy values with corrected ones;
119+
* combining *production* features with *ad-hoc* features.
120+
121+
122+
Dictionary
123+
----------
124+
125+
All features are identified by their name and namespace. In the example fact above, the feature is `widgets:inbound.count.1W`
126+
where `widgets` is the namespace and `inbound.count.1W` is the name. With Ivory we must also associate with any namespace-name
127+
feature identifier the following metadata:
128+
129+
* An *encoding* specifying the type encoding of a feature value:
130+
* `boolean`
131+
* `int`
132+
* `double`
133+
* `string`
134+
135+
* A *classification* type specifying how a feature value can be semantically interpreted and used:
136+
* `numerical`
137+
* `categorical`
138+
139+
* A human-readable *description*.
140+
141+
In Ivory, feature metadata is seperated from the features store (facts) in its own set of text files known
142+
as *feature dictionaries*. Dictionary text files are also pipe-delimited and of the following form:
143+
144+
```
145+
namespace|name|encoding|type|description
146+
```
147+
148+
So for the fact above, we could specify a dictionary entry such as:
149+
150+
```
151+
widgets|inbound.count.1W|int|numerical|Count in the last week
152+
```
153+
154+
Other dictionary entries might look like the following:
155+
156+
```
157+
demo|gender|string|categorical|Gender
158+
demo|postcode|string|categorical|Postcode
159+
```
160+
161+
Given a dictionary, we can use Ivory to validate facts in a feature store. The `validate` command will
162+
check that the encoding types specified for features in the dictionary are consistent with facts:
163+
164+
```
165+
> ivory validate --feature-store feature_store.txt --dictionary feature_dictionary.txt
166+
```
167+
168+
We can also use Ivory to generate statistics for the values of specific features accross a feature store using the
169+
`inspect` command. This will compute statistics such as density, ranges (for numerical features), factors (for
170+
categorical features), historgrams, means, etc. Inspections can filter both the features of interest as well which
171+
facts to considered by time:
172+
173+
```
174+
> ivory inspect --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --start-time '2013-01-01' --end-time '2014-01-01'
175+
```
176+
177+
178+
Querying
179+
--------
180+
181+
Ivory supports two types of queries: *snapshots* and *chords*.
182+
183+
184+
A `snaphot` query is used to extract the feature values for entities at a certain point in time. Snapshoting can filter
185+
the set of features and/or entities considered. By default the output is in *EAVT* format, but can be output in
186+
row-oriented form (i.e. column per feature) using the `--pivot` option. When a snapshot` query is performed, the most
187+
recent feature value for a given entity-attribute, with respect to the snapshot time, will be returned in the output:
188+
189+
```
190+
# get a snapshot of values for specific features and entities as of 1 Nov 2013
191+
> ivory snapshot --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --entities entities.txt --time '2013-11-01' --output nov2013snapshot
192+
193+
# Pivot the table to be row oriented
194+
> ivory snapshot --pivot --feature-store feature_store.txt --dictionary feature_dictionary.txt --features features.txt --entities entities.txt --time '2013-11-01' --output nov2013snapshot
195+
```
196+
197+
A `chord` query is used to extract the feature values for entities at different points in time - *instances*. That is, for each entity, a
198+
different time is specified. In fact, multiple times can be specified per entity. To invoke a chord query, a file of *instance
199+
descriptors* must be specified that are entity-time pairs, for example:
200+
201+
```
202+
928340|2013-11-01
203+
928340|2013-11-08
204+
928316|2013-11-08
205+
928316|2013-11-15
206+
```
207+
208+
Like `snapshot`, `chord` by default will output in *EAVT* format, but can be output in row-oriented form using the `--pivot` option:
209+
210+
```
211+
> ivory chord --feature-store feature_store.txt --dictionary feature_dictionary.txt --instances instances.txt --output nov2013snapshot
212+
213+
> ivory chord --pivot --feature-store feature_store.txt --dictionary feature_dictionary.txt --instances instances.txt --output nov2013snapshot
214+
```
215+
216+
217+
Data Generation
218+
---------------
219+
220+
Ivory supports generating random dictionaries and fact sets which can be used for testing.
221+
222+
To generate a random dictionary, you need to specify the number of features, and an output location:
223+
224+
```
225+
> ivory generate-dictionary --features 10000 --output random_dictionary
226+
```
227+
228+
This outputs two files:
229+
230+
1. the dictionary itself.
231+
2. a feature flag file specifying the sparcity and frequency of each feature, where sparcity is a double between 0.0 and 1.0, and frequency is one of `daily`, `weekly`, or `monthly`.
232+
233+
The format of the feature flag file is:
234+
235+
```
236+
namespace|name|sparcity|fequency
237+
```
238+
239+
An example is:
240+
```
241+
widgets|inbound.count.1W|0.4|weekly
242+
demo|postcode|0.7|monthly
243+
```
244+
245+
To generate random facts, you need to specify a dictionary, feature flag file, number of entities, time range, number of fact sets, and output location:
246+
247+
```
248+
> ivory generate-facts --dictionary feature_dictionary.txt --flags feature_flags.txt --entities 10000000 --time-range 2012-01-01_to_2012-12-31 --factsets 3 --output random_factsets
249+
```
250+
251+
This outputs a fact set partitioned by *namespace* and *date* so it can be used as part of a feature store.
252+
253+
254+
255+
Versioning
256+
----------
257+
258+
The format of fact sets are versioned. This allows the format of fact sets to be modified in the future but still maintain feature stores that
259+
reference fact sets persisted in an older format.
260+
261+
A fact set format version is specifed by a `.version` file that is stored at the root directory of a given fact set.

bin/ci

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/bin/sh
2+
3+
./sbt -Dsbt.log.noformat=true "; clean; update; test-only -- console junitxml html; publish; project cli; set credentials := Seq(Credentials(realm=\"Amazon S3\", host=\"ambiata-dist.s3.amazonaws.com\", userName=\"$AWS_ACCESS_KEY\", passwd=\"$AWS_SECRET_KEY\")); s3Upload; project ivory; echo-version"

bin/ivory

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/bin/bash -e
2+
3+
IVORY_JAR="ivory-assembly-*.jar"
4+
5+
function show_usage {
6+
echo $"Usage: $0 {create-repo|create-store|import-dictionary|import-store|import-text-eavt|validate-store|validate-factset|snapshot|gen-dictionary|gen-facts}"
7+
}
8+
9+
if [ "$1" == "" ]; then
10+
show_usage
11+
fi
12+
13+
CMD=$1
14+
shift
15+
case "$CMD" in
16+
create-repo)
17+
CLI="com.ambiata.ivory.repository.CreateRepositoryCli $@"
18+
;;
19+
create-store)
20+
CLI="com.ambiata.ivory.repository.CreateFeatureStoreCli $@"
21+
;;
22+
import-dictionary)
23+
CLI="com.ambiata.ivory.ingest.DictionaryImporterCli $@"
24+
;;
25+
import-store)
26+
CLI="com.ambiata.ivory.ingest.FeatureStoreImporterCli $@"
27+
;;
28+
import-text-eavt)
29+
CLI="com.ambiata.ivory.ingest.TextEavtImporterCli $@"
30+
;;
31+
validate-store)
32+
CLI="com.ambiata.ivory.validate.ValidateStoreCli $@"
33+
;;
34+
validate-factset)
35+
CLI="com.ambiata.ivory.validate.ValidateStoreFactSetCli $@"
36+
;;
37+
snapshot)
38+
CLI="com.ambiata.ivory.snapshot.SnapshotCli $@"
39+
;;
40+
gen-dictionary)
41+
CLI="com.ambiata.ivory.generation.GenerateDictionaryCli $@"
42+
;;
43+
gen-facts)
44+
CLI="com.ambiata.ivory.generation.GenerateFactsCli $@"
45+
;;
46+
47+
*)
48+
show_usage
49+
exit 1
50+
51+
esac
52+
53+
54+
hadoop jar $IVORY_JAR $CLI
55+
56+
echo "Done"

0 commit comments

Comments
 (0)