You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-44751][SQL] XML FileFormat Interface implementation
### What changes were proposed in this pull request?
This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)).
The previous [PR](apache#41832) ported the spark-xml package.
This PR addresses the following:
- Implement FileFormat interface
- Address the review comments in the previous [XML PR](apache#41832)
- Moved from_xml and schema_of_xml to sql/functions
- Moved ".xml" to DataFrameReader/DataFrameWriter
- Removed old APIs like XmlRelation, XmlReader, etc.
- StaxXmlParser changes:
- Use FailureSafeParser
- Convert 'Row' usage to 'InternalRow'
- Convert String to UTF8String
- Handle MapData and ArrayData for MapType and ArrayType respectively
- Use TimestampFormatter to parse timestamp
- Use DateFormatter to parse date
- StaxXmlGenerator changes:
- Convert 'Row' usage to 'InternalRow'
- Handle UTF8String for StringType
- Handle MapData and ArrayData for MapType and ArrayType respectively
- Use TimestampFormatter to format timestamp
- Use DateFormatter to format date
- Update XML tests accordingly because of the above changes
### Why are the changes needed?
These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented.
### Does this PR introduce _any_ user-facing change?
Yes, it adds support for XML data source.
### How was this patch tested?
- Ran all the XML unit tests.
- Github Action
Closesapache#42462 from sandip-db/xml-file-format-master.
Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Copy file name to clipboardexpand all lines: common/utils/src/main/resources/error/error-classes.json
+5
Original file line number
Diff line number
Diff line change
@@ -579,6 +579,11 @@
579
579
"<errors>"
580
580
]
581
581
},
582
+
"INVALID_XML_MAP_KEY_TYPE" : {
583
+
"message" : [
584
+
"Input schema <schema> can only contain STRING as a key type for a MAP."
585
+
]
586
+
},
582
587
"IN_SUBQUERY_DATA_TYPE_MISMATCH" : {
583
588
"message" : [
584
589
"The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery. Mismatched columns: [<mismatchedColumns>], left side: [<leftType>], right side: [<rightType>]."
Copy file name to clipboardexpand all lines: docs/sql-error-conditions-datatype-mismatch-error-class.md
+4
Original file line number
Diff line number
Diff line change
@@ -123,6 +123,10 @@ The `<functionName>` does not support ordering on type `<dataType>`.
123
123
124
124
`<errors>`
125
125
126
+
## INVALID_XML_MAP_KEY_TYPE
127
+
128
+
Input schema `<schema>` can only contain STRING as a key type for a MAP.
129
+
126
130
## IN_SUBQUERY_DATA_TYPE_MISMATCH
127
131
128
132
The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery. Mismatched columns: [`<mismatchedColumns>`], left side: [`<leftType>`], right side: [`<rightType>`].
0 commit comments