Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop hard-coding sc:Integer, allow for sc:Float #4

Merged
merged 1 commit into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ Same as above but use a JVM option in domain.xml such as the example below.
### Differences from Kaggle

- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types . See https://github.com/IQSS/dataverse/issues/4943#issuecomment-2145333830
- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally.

### Differences from pyDataverse

Expand Down
21 changes: 19 additions & 2 deletions src/main/java/io/gdcc/spi/export/croissant/CroissantExporter.java
Original file line number Diff line number Diff line change
Expand Up @@ -282,14 +282,19 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
String variableDescription = dataVariableObject.getString("label", "");
String variableFormatType =
dataVariableObject.getString("variableFormatType");
String variableIntervalType =
dataVariableObject.getString("variableIntervalType");
String dataType = null;
/**
* There are only two variableFormatType types on the Dataverse side:
* CHARACTER and NUMERIC. (See VariableType in DataVariable.java.)
*/
switch (variableFormatType) {
case "CHARACTER":
dataType = "sc:Text";
break;
case "NUMERIC":
// TODO: Integer? What about other numeric types?
dataType = "sc:Integer";
dataType = getNumericType(variableIntervalType);
break;
default:
break;
Expand Down Expand Up @@ -400,4 +405,16 @@ private String getBibtex(
sb.append("}");
return sb.toString();
}

private String getNumericType(String variableIntervalType) {
/**
* According to DataVariable.java in Dataverse, the four possibilities are: discrete, contin
* (continuous), nominal, and dichotomous.
*/
return switch (variableIntervalType) {
case "discrete" -> "sc:Integer";
case "contin" -> "sc:Float";
default -> "sc:Text";
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,13 @@ public void testExportDatasetMax() throws Exception {
assertEquals(prettyPrint(expected), prettyPrint(outputStreamMax.toString()));
}

/*
The data in stata13-auto.dta looks something like this:
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
"AMC Concord" 4099 22 3 2.5 11 2930 186 40 121 3.58 0
"AMC Pacer" 4749 17 3 3.0 11 3350 173 40 258 2.53 0
"AMC Spirit" 3799 22 3.0 12 2640 168 35 121 3.08 0
*/
@Test
public void testExportDatasetCars() throws Exception {
exporter.exportDataset(dataProviderCars, outputStreamCars);
Expand Down
4 changes: 2 additions & 2 deletions src/test/resources/cars/expected/cars-croissant.json
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@
"@type": "cr:Field",
"name": "headroom",
"description": "Headroom (in.)",
"dataType": "sc:Integer",
"dataType": "sc:Float",
"source": {
"@id": "7",
"fileObject": {
Expand Down Expand Up @@ -239,7 +239,7 @@
"@type": "cr:Field",
"name": "gear_ratio",
"description": "Gear Ratio",
"dataType": "sc:Integer",
"dataType": "sc:Float",
"source": {
"@id": "8",
"fileObject": {
Expand Down