-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor statusapiv1 to trait and implement for ease of creation of these objects when we implement our own parser #248
Refactor statusapiv1 to trait and implement for ease of creation of these objects when we implement our own parser #248
Conversation
…or uptime is allocated resources and task time is used resources
…efactor # Conflicts: # app/com/linkedin/drelephant/spark/fetchers/SparkFetcher.scala
…hese objects when we implement our own parser
…ndle dynamic allocation" This reverts commit 9d550f4.
Hi @shankar37, I think the PR is missing a commit renaming API classes to *Impl. The PR seems to be missing a bit of context. Do you plan to make the JSON parsing streaming? |
@shankar37 From this PR, it is difficult to understand the motivation of the change, and therefore it's difficult to review. Can you please share the design of the eventlogparsing code. |
Here is hopefully the context. Currently, the event log parsing happens in two ways. For the SparkFetcher, it streams the json event log and looks for EnvironmentUpdate Event only. This code is in SparkLogClient.scala. This does not use the spark's replaybus and uses only the public apis of Spark. In addition, the SparkFetcher uses the SparkRestClient to get the rest of the data from REST APIs and deserializes them directly into the objects of statusapiv1. The other one is in SparkFSFetcher, which uses replaybus and sparklistners's to parse the event log. Then it uses the LegacyDataConverter to convert the data read into the statusapiv1. There are a couple of problems with this. First, this doesn't parse all the data from the event log and some of the data we might require is not present. Second, in light of SPARK-18085 and this commit (apache/spark@561e9cc), these replaybus and these listeners are deprecated. Hence, we need to change the parsing of event log for FS Fetcher to use our own parser. What I am planning to do is to write a generic EventLogParser which will take a stream and parse the event log json. It will parse it like SparkLogClient but for all events that we care for and will produce the StatusApiV1 data in its completely. Then both the SparkFSFetcher and SparkLogClient will be made to use. It will take boolean flags to indicate which part of the data needs to be parsed to avoid parsing stuff the client dont need. When we do that the EventLogParser needs to read one event at a time, store some intermediate data and convert it into statusapiv1 structure. I am trying to avoid create a lot of intermediate data and field by field copy to statusapiv1 like for legacydataconverter does. Instead, I want to create the statusapiv1 Impl objects, fill and modify data as I parse and calculate them and then just return it asInstanceOf statusapiv1 trait. This PR is the first step towards that. It makes the SparkAppliationData contain only traits. So, readers( heuristics and metricsAggregator) will continue to get read only data. But writers can create the Impl objects and write to individual fields. The problem with have a class with only Val is that you have to create it only when all the fields are available. And that is difficult to do when you are parsing event logs one line at a time. |
Thanks @shankar37 for providing this detail. |
I am planning to move that code into the new eventlogparser and expand on that. I will continue to rely on SparkListenerEvent and its derived classes as I didnt find any good way to remove that dependency. Do you have any ideas on how to not depend on that ? |
I don't have any ideas either on how to remove these dependencies. Maybe we should think about it when you submit the PR. As of now, I think the goal should be to have the minimum dependency on Spark. |
This change LGTM. |
val numCompletedStages: Int, | ||
val numSkippedStages: Int, | ||
val numFailedStages: Int) | ||
trait ApplicationInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shankar37
Why do we have these traits? Why can't we just have simple classes with var
arguments?
These traits are returned as part of HadoopApplicationData interface. If we
just use the classes with vars then the reader can potentially change the
values and wont be read only. So, to keep the interface clean, we need the
traits.
…On Fri, May 19, 2017 at 3:56 AM, Shekhar Gupta ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In app/com/linkedin/drelephant/spark/fetchers/statusapiv1/
statusapiv1.scala
<#248 (comment)>:
> - val description: Option[String],
- val submissionTime: Option[Date],
- val completionTime: Option[Date],
- val stageIds: Seq[Int],
- val jobGroup: Option[String],
- val status: JobExecutionStatus,
- val numTasks: Int,
- val numActiveTasks: Int,
- val numCompletedTasks: Int,
- val numSkippedTasks: Int,
- val numFailedTasks: Int,
- val numActiveStages: Int,
- val numCompletedStages: Int,
- val numSkippedStages: Int,
- val numFailedStages: Int)
+trait ApplicationInfo {
@shankar37 <https://github.com/shankar37>
Why do we have these traits? Why can't we just have simple classes with
var arguments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#248 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMlRI_lJAexf6SgqpdvlVkEE6F8fF-uZks5r7MWGgaJpZM4NbF6R>
.
|
Shekhar,
I am planning to merge this so that the next PR is a clean diff of the
parsing changes alone. Let me know by EOD if you have any further comments.
thanks
shankar
…On Mon, May 22, 2017 at 10:28 AM, Shankar M ***@***.***> wrote:
These traits are returned as part of HadoopApplicationData interface. If
we just use the classes with vars then the reader can potentially change
the values and wont be read only. So, to keep the interface clean, we need
the traits.
On Fri, May 19, 2017 at 3:56 AM, Shekhar Gupta ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In app/com/linkedin/drelephant/spark/fetchers/statusapiv1/statu
> sapiv1.scala
> <#248 (comment)>:
>
> > - val description: Option[String],
> - val submissionTime: Option[Date],
> - val completionTime: Option[Date],
> - val stageIds: Seq[Int],
> - val jobGroup: Option[String],
> - val status: JobExecutionStatus,
> - val numTasks: Int,
> - val numActiveTasks: Int,
> - val numCompletedTasks: Int,
> - val numSkippedTasks: Int,
> - val numFailedTasks: Int,
> - val numActiveStages: Int,
> - val numCompletedStages: Int,
> - val numSkippedStages: Int,
> - val numFailedStages: Int)
> +trait ApplicationInfo {
>
> @shankar37 <https://github.com/shankar37>
> Why do we have these traits? Why can't we just have simple classes with
> var arguments?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#248 (review)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AMlRI_lJAexf6SgqpdvlVkEE6F8fF-uZks5r7MWGgaJpZM4NbF6R>
> .
>
|
@shankar37 LGTM. |
…hese objects when we implement our own parser (linkedin#248)
This changes the statusapiv1 classes to be a trait and creates an impl classes for the same. This change in itself seems to achieve nothing. But this is in preparation for rewriting the eventlogparsing code to be implemented using our custom listeners and create the objects from that. The previous classes with val were hard to create as it required to have all the data before creating the objects. the new impl classes have var members which will make it easy to set data.