Skip to content

Commit 5a804e8

Browse files
committed
Merge pull request #72 from ambiata/cofarrell/minor-cleanup
Minor cleanup for various unrelated things
2 parents 0e13188 + e4f863d commit 5a804e8

File tree

19 files changed

+70
-88
lines changed

19 files changed

+70
-88
lines changed

README.md

+10-10
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ivory_repository/
2929
│ └── stores
3030
│ ├── feature_store1
3131
│ └── feature_store2
32-
└── fact_sets
32+
└── factsets
3333
├── fact_set1
3434
└── fact_set2
3535
```
@@ -74,8 +74,8 @@ my_fact_set/
7474
7575
```
7676

77-
In this fact set, facts are partioned across two namespaces: `widgets` and `demo`. The *widget* facts
78-
are spread accross three dates, while *demographic* facts are constrained to one. Note also that
77+
In this fact set, facts are partitioned across two namespaces: `widgets` and `demo`. The *widget* facts
78+
are spread across three dates, while *demographic* facts are constrained to one. Note also that
7979
a given namespace-partition can contain multiple EAVT files.
8080

8181
EAVT files are simply pipe-delimited text files with one EAVT record per line. For example, a line in
@@ -112,10 +112,10 @@ The ordering is important as it allows facts to be overriden. When a feature sto
112112
with the same entity, attribute and time are identified, the value from the fact contained in the most recent fact
113113
set will be used, where most recent means listed higher in the feature store file.
114114

115-
Because a feature store can be speified by just referencing fact sets, Ivory can support poor-man versioning giving
115+
Because a feature store can be specified by just referencing fact sets, Ivory can support poor-man versioning giving
116116
rise to use cases such as:
117117

118-
* overrding buggy values with corrected ones;
118+
* overriding buggy values with corrected ones;
119119
* combining *production* features with *ad-hoc* features.
120120

121121

@@ -138,7 +138,7 @@ feature identifier the following metadata:
138138

139139
* A human-readable *description*.
140140

141-
In Ivory, feature metadata is seperated from the features store (facts) in its own set of text files known
141+
In Ivory, feature metadata is separated from the features store (facts) in its own set of text files known
142142
as *feature dictionaries*. Dictionary text files are also pipe-delimited and of the following form:
143143

144144
```
@@ -165,7 +165,7 @@ check that the encoding types specified for features in the dictionary are consi
165165
> ivory validate --feature-store feature_store.txt --dictionary feature_dictionary.txt
166166
```
167167

168-
We can also use Ivory to generate statistics for the values of specific features accross a feature store using the
168+
We can also use Ivory to generate statistics for the values of specific features across a feature store using the
169169
`inspect` command. This will compute statistics such as density, ranges (for numerical features), factors (for
170170
categorical features), historgrams, means, etc. Inspections can filter both the features of interest as well which
171171
facts to considered by time:
@@ -181,7 +181,7 @@ Querying
181181
Ivory supports two types of queries: *snapshots* and *chords*.
182182

183183

184-
A `snaphot` query is used to extract the feature values for entities at a certain point in time. Snapshoting can filter
184+
A `snapshot` query is used to extract the feature values for entities at a certain point in time. Snapshotting can filter
185185
the set of features and/or entities considered. By default the output is in *EAVT* format, but can be output in
186186
row-oriented form (i.e. column per feature) using the `--pivot` option. When a `snapshot` query is performed, the most
187187
recent feature value for a given entity-attribute, with respect to the snapshot time, will be returned in the output:
@@ -233,7 +233,7 @@ This outputs two files:
233233
The format of the feature flag file is:
234234

235235
```
236-
namespace|name|sparcity|fequency
236+
namespace|name|sparcity|frequency
237237
```
238238

239239
An example is:
@@ -258,4 +258,4 @@ Versioning
258258
The format of fact sets are versioned. This allows the format of fact sets to be modified in the future but still maintain feature stores that
259259
reference fact sets persisted in an older format.
260260

261-
A fact set format version is specifed by a `.version` file that is stored at the root directory of a given fact set.
261+
A fact set format version is specified by a `.version` file that is stored at the root directory of a given fact set.

doc/dates.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Ivory supports a sub-set of ISO 8601 timestamps.
7474

7575
`yyyy-MM-dd` -
7676

77-
Date with a day granularitry in the local time zone. Example: `2012-01-15`,
77+
Date with a day granularity in the local time zone. Example: `2012-01-15`,
7878
`2014-12-31`.
7979

8080
#### Local Date And Time
@@ -267,7 +267,7 @@ E4|A1|3|2010-03-03T14:30:00+11:00
267267

268268
##### `Ingestion Solution 2`
269269
Perform individual ingestions for each timezone, using the
270-
"Local date / time" format, but specificy an overriding
270+
"Local date / time" format, but specify an overriding
271271
ingestion timezone for the whole dataset. The ingestion
272272
will then translate each row into the repository timezone.
273273

@@ -308,7 +308,7 @@ To address this we could do one of two things:
308308
- annotate DST overlapped hours with an extra bit in the time field; or
309309
- offset time by an additional interval to handle the gained time.
310310

311-
However, both of these things require non-standard treament of "second
311+
However, both of these things require non-standard treatment of "second
312312
of day" and will require code changes to ivory to handle.
313313

314314
To be clear, at this point ivory handles "second of day" based only on
@@ -326,7 +326,7 @@ There are number of key pieces of this which are not complete:
326326
there is no "standard library" for dealing with dates in a
327327
consistent way within ivory.
328328

329-
- The ISO 8601 variants are not complete and not uniformally
329+
- The ISO 8601 variants are not complete and not uniformly
330330
supported.
331331

332332
- Ingestion incorrectly forces a timezone to be specified for

doc/quality.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ Current Quality Hitlist
66
* Remove duplication, there is too much conceptual duplication in storage
77
* Remove "hole" in the middle anti-pattern, composition first.
88
* Configuration goes in as arguments. Remove mix of "configuration" styles with implicits and readers.
9-
* Consist effect handling, unsafePerformIO's go at the top.
9+
* Consistent effect handling, unsafePerformIO's go at the top.

doc/remotes.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ repos for individuals to cook up their own features:
3434

3535
* improve performance of snapshots as of "now"
3636
* steps:
37-
1. generate a snaphsot using the traditional approach
37+
1. generate a snapshot using the traditional approach
3838
2. store the snapshot as a fact set in another repo, the "snapshot" repo
3939
3. in the "snapshot repo" create a feature store that includes the snapshot fact set and all fact sets
4040
added since the first snapshot
@@ -61,7 +61,7 @@ Generalising the idea of versioning
6161
One of the core ideas of Ivory is that it is an immutable *database* of facts. Immutable views or *versions* of
6262
the database are constructed by combining a specific feature store and dictionary together. All queries, then,
6363
should be with respect to a particular *version*. Whilst the design of ivory allows for the notion of versions,
64-
it is currently not a first class citizen. Furthermore, it were to be made a first class citizen, the mechansim
64+
it is currently not a first class citizen. Furthermore, it were to be made a first class citizen, the mechanism
6565
for dealing with remote repos may fall out more naturally.
6666

6767
There are a number of *objects* in our data model that should be versioned:
@@ -73,9 +73,9 @@ There are a number of *objects* in our data model that should be versioned:
7373

7474
It may be worth borrowing ideas from Git on how this is designed. For example:
7575

76-
* Version identifers are hashes of their content. For fact sets we could use CRCs associated with the data.
76+
* Version identifiers are hashes of their content. For fact sets we could use CRCs associated with the data.
7777
* Have human-readable references to identifiers, i.e. *branches* and *tags*.
7878
* The concept of branches is interesting in that it suggests a lineage between different versions. Given the
79-
changes to dictionaries and feature store are typcially incremental in nature, the idea of a version being
79+
changes to dictionaries and feature store are typically incremental in nature, the idea of a version being
8080
a delta applied to a *parent* version may be worth while.
8181
* This all, of course, plays in to the *remote* concept. That is, remote fact sets can be referenced by version.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
package com.ambiata.ivory.cli
2+
3+
object ScoptReaders {
4+
5+
implicit val charRead: scopt.Read[Char] =
6+
scopt.Read.reads(str => {
7+
val chars = str.toCharArray
8+
chars.length match {
9+
case 0 => throw new IllegalArgumentException(s"'${str}' can not be empty!")
10+
case 1 => chars(0)
11+
case l => throw new IllegalArgumentException(s"'${str}' is not a char!")
12+
}
13+
})
14+
}

ivory-cli/src/main/scala/com/ambiata/ivory/cli/chord.scala

+1-9
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,7 @@ object chord extends IvoryApp {
1919

2020
case class CliArguments(repo: String, output: String, tmp: String, entities: String, takeSnapshot: Boolean, pivot: Boolean, delim: Char, tombstone: String)
2121

22-
implicit val charRead: scopt.Read[Char] =
23-
scopt.Read.reads(str => {
24-
val chars = str.toCharArray
25-
chars.length match {
26-
case 0 => throw new IllegalArgumentException(s"'${str}' can not be empty!")
27-
case 1 => chars(0)
28-
case l => throw new IllegalArgumentException(s"'${str}' is not a char!")
29-
}
30-
})
22+
import ScoptReaders.charRead
3123

3224
val parser = new scopt.OptionParser[CliArguments]("extract-chord") {
3325
head("""

ivory-cli/src/main/scala/com/ambiata/ivory/cli/ingest.scala

+7-10
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,7 @@ import scalaz.{DList => _, _}, Scalaz._
1919

2020
object ingest extends IvoryApp {
2121

22-
val tombstone = List("")
23-
24-
case class CliArguments(repo: String, dictionary: Option[String], input: String, namespace: String, tmp: String, timezone: DateTimeZone, runOnSingleMachine: Boolean)
22+
case class CliArguments(repo: String, dictionary: Option[String], input: String, namespace: String, timezone: DateTimeZone, runOnSingleMachine: Boolean)
2523

2624
val parser = new scopt.OptionParser[CliArguments]("ingest") {
2725
head("""
@@ -33,7 +31,6 @@ object ingest extends IvoryApp {
3331

3432
help("help") text "shows this usage text"
3533
opt[String]('r', "repo") action { (x, c) => c.copy(repo = x) } required() text "Path to an ivory repository."
36-
opt[String]('t', "tmp") action { (x, c) => c.copy(tmp = x) } required() text "Path to store tmp data."
3734
opt[String]('i', "input") action { (x, c) => c.copy(input = x) } required() text "Path to data to import."
3835
opt[String]('d', "dictionary") action { (x, c) => c.copy(dictionary = Some(x)) } text "Name of dictionary to use."
3936
opt[String]('n', "namespace") action { (x, c) => c.copy(namespace = x) } required() text "Namespace'."
@@ -43,20 +40,20 @@ object ingest extends IvoryApp {
4340

4441
}
4542

46-
def cmd = IvoryCmd[CliArguments](parser, CliArguments("", None, "", "", "", DateTimeZone.getDefault, false), HadoopCmd { configuration => c =>
47-
val res = onHdfs(new Path(c.repo), c.dictionary, c.namespace, new Path(c.input), tombstone, new Path(c.tmp), c.timezone, c.runOnSingleMachine)
43+
def cmd = IvoryCmd[CliArguments](parser, CliArguments("", None, "", "", DateTimeZone.getDefault, false), HadoopCmd { configuration => c =>
44+
val res = onHdfs(new Path(c.repo), c.dictionary, c.namespace, new Path(c.input), c.timezone, c.runOnSingleMachine)
4845
res.run(configuration.modeIs(com.nicta.scoobi.core.Mode.Cluster)).map {
4946
case f => List(s"Successfully imported '${c.input}' as ${f} into '${c.repo}'")
5047
}
5148
})
5249

53-
def onHdfs(repo: Path, dictionary: Option[String], namespace: String, input: Path, tombstone: List[String], tmp: Path, timezone: DateTimeZone, runOnSingleMachine: Boolean): ScoobiAction[Factset] =
54-
fatrepo.ImportWorkflow.onHdfs(repo, dictionary.map(defaultDictionaryImport(_)), importFeed(input, namespace, runOnSingleMachine), tombstone, tmp, timezone)
50+
def onHdfs(repo: Path, dictionary: Option[String], namespace: String, input: Path, timezone: DateTimeZone, runOnSingleMachine: Boolean): ScoobiAction[Factset] =
51+
fatrepo.ImportWorkflow.onHdfs(repo, dictionary.map(defaultDictionaryImport(_)), importFeed(input, namespace, runOnSingleMachine), timezone)
5552

56-
def defaultDictionaryImport(dictionary: String)(repo: HdfsRepository, name: String, tombstone: List[String], tmpPath: Path): Hdfs[Unit] =
53+
def defaultDictionaryImport(dictionary: String)(repo: HdfsRepository, name: String): Hdfs[Unit] =
5754
DictionaryImporter.onHdfs(repo.root.toHdfs, repo.dictionaryByName(dictionary).toHdfs, name)
5855

59-
def importFeed(input: Path, namespace: String, runOnSingleMachine: Boolean)(repo: HdfsRepository, factset: Factset, dname: String, tmpPath: Path, errorPath: Path, timezone: DateTimeZone): ScoobiAction[Unit] = for {
56+
def importFeed(input: Path, namespace: String, runOnSingleMachine: Boolean)(repo: HdfsRepository, factset: Factset, dname: String, errorPath: Path, timezone: DateTimeZone): ScoobiAction[Unit] = for {
6057
dict <- ScoobiAction.fromHdfs(IvoryStorage.dictionaryFromIvory(repo, dname))
6158
conf <- ScoobiAction.scoobiConfiguration
6259
_ <- if (!runOnSingleMachine)

ivory-cli/src/main/scala/com/ambiata/ivory/cli/ingestBulk.scala

+7-10
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,7 @@ import scalaz.{DList => _, _}, Scalaz._
1818

1919
object ingestBulk extends IvoryApp {
2020

21-
val tombstone = List("")
22-
23-
case class CliArguments(repo: String, dictionary: Option[String], input: String, tmp: String, timezone: DateTimeZone, optimal: Long, codec: Option[CompressionCodec])
21+
case class CliArguments(repo: String, dictionary: Option[String], input: String, timezone: DateTimeZone, optimal: Long, codec: Option[CompressionCodec])
2422

2523
val parser = new scopt.OptionParser[CliArguments]("ingest-bulk") {
2624
head("""
@@ -34,7 +32,6 @@ object ingestBulk extends IvoryApp {
3432
opt[Unit]('n', "no-compression") action { (_, c) => c.copy(codec = None) } text "Don't use compression."
3533

3634
opt[String]('r', "repo") action { (x, c) => c.copy(repo = x) } required() text "Path to an ivory repository."
37-
opt[String]('t', "tmp") action { (x, c) => c.copy(tmp = x) } required() text "Path to store tmp data."
3835
opt[String]('i', "input") action { (x, c) => c.copy(input = x) } required() text "Path to data to import."
3936
opt[Long]('o', "optimal-input-chunk") action { (x, c) => c.copy(optimal = x) } text "Optimal size (in bytes) of input chunk.."
4037
opt[String]('d', "dictionary") action { (x, c) => c.copy(dictionary = Some(x)) } text "Name of dictionary to use."
@@ -47,21 +44,21 @@ object ingestBulk extends IvoryApp {
4744
type Parts = String
4845

4946
def cmd = IvoryCmd[CliArguments](parser,
50-
CliArguments("", None, "", "", DateTimeZone.getDefault, 1024 * 1024 * 256 /* 256MB */, Some(new SnappyCodec)),
47+
CliArguments("", None, "", DateTimeZone.getDefault, 1024 * 1024 * 256 /* 256MB */, Some(new SnappyCodec)),
5148
ScoobiCmd(configuration => c => {
52-
val res = onHdfs(new Path(c.repo), c.dictionary, new Path(c.input), tombstone, new Path(c.tmp), c.timezone, c.optimal, c.codec)
49+
val res = onHdfs(new Path(c.repo), c.dictionary, new Path(c.input), c.timezone, c.optimal, c.codec)
5350
res.run(configuration).map {
5451
case f => List(s"Successfully imported '${c.input}' as ${f} into '${c.repo}'")
5552
}
5653
}))
5754

58-
def onHdfs(repo: Path, dictionary: Option[String], input: Path, tombstone: List[String], tmp: Path, timezone: DateTimeZone, optimal: Long, codec: Option[CompressionCodec]): ScoobiAction[Factset] =
59-
fatrepo.ImportWorkflow.onHdfs(repo, dictionary.map(defaultDictionaryImport(_)), importFeed(input, optimal, codec), tombstone, tmp, timezone)
55+
def onHdfs(repo: Path, dictionary: Option[String], input: Path, timezone: DateTimeZone, optimal: Long, codec: Option[CompressionCodec]): ScoobiAction[Factset] =
56+
fatrepo.ImportWorkflow.onHdfs(repo, dictionary.map(defaultDictionaryImport(_)), importFeed(input, optimal, codec), timezone)
6057

61-
def defaultDictionaryImport(dictionary: String)(repo: HdfsRepository, name: String, tombstone: List[String], tmpPath: Path): Hdfs[Unit] =
58+
def defaultDictionaryImport(dictionary: String)(repo: HdfsRepository, name: String): Hdfs[Unit] =
6259
DictionaryImporter.onHdfs(repo.root.toHdfs, repo.dictionaryByName(dictionary).toHdfs, name)
6360

64-
def importFeed(input: Path, optimal: Long, codec: Option[CompressionCodec])(repo: HdfsRepository, factset: Factset, dname: String, tmpPath: Path, errorPath: Path, timezone: DateTimeZone): ScoobiAction[Unit] = for {
61+
def importFeed(input: Path, optimal: Long, codec: Option[CompressionCodec])(repo: HdfsRepository, factset: Factset, dname: String, errorPath: Path, timezone: DateTimeZone): ScoobiAction[Unit] = for {
6562
dict <- ScoobiAction.fromHdfs(IvoryStorage.dictionaryFromIvory(repo, dname))
6663
list <- listing(input)
6764
conf <- ScoobiAction.scoobiConfiguration

ivory-cli/src/main/scala/com/ambiata/ivory/cli/main.scala

+2-2
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ case class IvoryCmd[A](parser: scopt.OptionParser[A], initial: A, runner: IvoryR
7373

7474
private def parseAndRun(args: Seq[String], result: A => ResultTIO[List[String]]): IO[Option[Unit]] = {
7575
parser.parse(args, initial)
76-
.map(result andThen {
76+
.traverse(result andThen {
7777
_.run.map(_.fold(_.foreach(println), e => { println(s"Failed! - ${Result.asString(e)}"); sys.exit(1) }))
78-
}).sequence
78+
})
7979
}
8080
}
8181

ivory-cli/src/main/scala/com/ambiata/ivory/cli/pivot.scala

+1-9
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,7 @@ object pivot extends IvoryApp {
1414

1515
case class CliArguments(input: String, output: String, dictionary: String, delim: Char, tombstone: String)
1616

17-
implicit val charRead: scopt.Read[Char] =
18-
scopt.Read.reads(str => {
19-
val chars = str.toCharArray
20-
chars.length match {
21-
case 0 => throw new IllegalArgumentException(s"'${str}' can not be empty!")
22-
case 1 => chars(0)
23-
case l => throw new IllegalArgumentException(s"'${str}' is not a char!")
24-
}
25-
})
17+
import ScoptReaders.charRead
2618

2719
val parser = new scopt.OptionParser[CliArguments]("extract-pivot") {
2820
head("""

ivory-cli/src/main/scala/com/ambiata/ivory/cli/pivotSnapshot.scala

+1-9
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,7 @@ object pivotSnapshot extends IvoryApp {
1818

1919
case class CliArguments(repo: String, output: String, delim: Char, tombstone: String, date: LocalDate)
2020

21-
implicit val charRead: scopt.Read[Char] =
22-
scopt.Read.reads(str => {
23-
val chars = str.toCharArray
24-
chars.length match {
25-
case 0 => throw new IllegalArgumentException(s"'${str}' can not be empty!")
26-
case 1 => chars(0)
27-
case l => throw new IllegalArgumentException(s"'${str}' is not a char!")
28-
}
29-
})
21+
import ScoptReaders.charRead
3022

3123
val parser = new scopt.OptionParser[CliArguments]("extract-pivot-snapshot") {
3224
head("""

ivory-example/bin/run

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
#!/bin/sh -eux
22

33
TARGET=${1:-/tmp/test-$RANDOM}
4-
IVORY=$TARGET/ivory
4+
REPO=$TARGET/ivory
55
DICT=$TARGET/dictionary
66
FACTS=$TARGET/facts
77
FLAGS=$TARGET/flags
88

99
IVORY=$(dirname $0)/ivory
1010
$IVORY generate-dictionary -n 5 -f 100 -o $TARGET
1111
$IVORY generate-facts -d $DICT -f $FLAGS -n 1000 -s 2012-01-01 -e 2012-02-01 -o $FACTS
12-
$IVORY create-repository -p $IVORY
13-
$IVORY import-dictionary -r $IVORY -p $DICT -n "example"
12+
$IVORY create-repository -p $REPO
13+
$IVORY import-dictionary -r $REPO -p $DICT -n "example"

ivory-ingest/src/main/scala/com/ambiata/ivory/ingest/mr.scala

+3-3
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ object IngestJob {
4848
job.setMapOutputKeyClass(classOf[LongWritable]);
4949
job.setMapOutputValueClass(classOf[BytesWritable]);
5050

51-
/* partiton & sort */
51+
/* partition & sort */
5252
job.setPartitionerClass(classOf[IngestPartitioner])
5353
job.setGroupingComparatorClass(classOf[LongWritable.Comparator])
5454
job.setSortComparatorClass(classOf[LongWritable.Comparator])
@@ -122,8 +122,8 @@ object IngestJob {
122122
/**
123123
* Partitioner for ivory-ingest.
124124
*
125-
* Keys are partitioned by the extrnalized feature id (held in the top 32 bits of the key)
126-
* into pre-determined buckets. We use the predtermined buckets as upfront knowledge of
125+
* Keys are partitioned by the externalized feature id (held in the top 32 bits of the key)
126+
* into predetermined buckets. We use the predetermined buckets as upfront knowledge of
127127
* the input size is used to reduce skew on input data.
128128
*/
129129
class IngestPartitioner extends Partitioner[LongWritable, BytesWritable] with Configurable {

ivory-mr/src/main/scala/com/ambiata/ivory/mr/DistCache.scala

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ import org.apache.hadoop.conf.Configuration
1414
import org.apache.hadoop.mapreduce.Job
1515

1616
/**
17-
* This is module for managing passing data-types via tha distributed cache. This is
17+
* This is module for managing passing data-types via the distributed cache. This is
1818
* _unsafe_ at best, and should be used with extreme caution. The only valid reason to
1919
* use it is when writing raw map reduce jobs.
2020
*/

0 commit comments

Comments
 (0)