Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The header of group by is incorrect #538

Closed
xxchan opened this issue Mar 1, 2022 · 6 comments · Fixed by #624
Closed

The header of group by is incorrect #538

xxchan opened this issue Mar 1, 2022 · 6 comments · Fixed by #624

Comments

@xxchan
Copy link
Member

xxchan commented Mar 1, 2022

The header is incorrect

> SELECT c0+1, c0+1+count(*) FROM t1 GROUP BY c0+1; 
+----+----------+
| c0 | ?column? |
+----+----------+
| 4  | 5        |
| 2  | 3        |
+----+----------+
in 0.004s

Originally posted by @xxchan in #535 (comment)

@skyzh
Copy link
Member

skyzh commented Apr 1, 2022

Now it will panic with:

SELECT c0+1, c0+1+count(*) FROM t1 GROUP BY c0+1;
thread 'tokio-runtime-worker' panicked at 'Column #ColumnRefId { database_id: 0, schema_id: 0, table_id: 11, column_id: 0 } should not be evaluated in `eval_array`', src/executor/evaluator.rs:55:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'failed to join query thread: JoinError::Panic(...)', src/main.rs:126:23

🤪

@skyzh
Copy link
Member

skyzh commented Apr 1, 2022

Despite the above issue, let's fix an easier one.

Case 1

create table t1 (c1 int);
insert into t1 values (1), (2), (3);
select c1 + 1 from t1 group by c1 + 1;

which will produce:

+----+
| c1 |
+----+
| 3  |
| 4  |
| 2  |
+----+

c1 should be c1 + 1 instead.

Case 2

create table t2 (c1 int, c2 int);
insert into t2 values (0, 1), (1, 2);
select c1 + c2 from t2 group by c1 + c2;

which yields:

+----+
| c1 |
+----+
| 1  |
| 3  |
+----+

(should be c1 + c2 instead)

Basically, this is a problem with schema. The derived schema is incorrect for logical aggregate.

.map(|expr| child_schema[expr.index].clone())

L78 directly used child schema, which is c1. But here we should have c1 + 1 / c1 + c2 as column desc instead. Also, in case of c1 + c2, the schema is totally incorrect. The derived schema for select c1 + c2 is:

[ColumnDesc { datatype: Int(None) (null), name: "c1", is_primary: false }, ColumnDesc { datatype: Int(None) (null), name: "c2", is_primary: false }]

while this executor will only output one column.

So we will need to rewrite the whole scheme derivation process. It should be as simple as:

        self.group_keys
            .iter().map(|expr| ColumnDesc { /* how to fill the inner struct? */ }).chain(/* agg call schema */)

Where we need to derive a schema for arbitrary expression. We can simply use format!("{:?}", expr) as column name, expr.return_type as data type, and set is_primary = false. We now have these three properties, and we can easily compose a ColumnDesc.

@skyzh
Copy link
Member

skyzh commented Apr 1, 2022

cc @shmiwy would you please have a try?

@shmiwy
Copy link
Contributor

shmiwy commented Apr 13, 2022

It seems that we can't compose a ColumnDesc simply use format!("{:?}", expr) as column name, expr.return_type as data type, and set is_primary = false, because we can use multi filed when we group by.

create table t(c1 int, c2 int, c3 int, c4 int);
insert into t values (0,1,2,3), (1,2,3,4), (2,3,4,5);
select c1 + c2, c3 + c4 from t group by c1 + c2, c3 + c4;

the logical_plan created by this line

let logical_plan = input_ref_resolver.rewrite(logical_plan);

shows like this

project_expressions: [
        InputRef #0,
        InputRef #1,
    ],
    child: LogicalAggregate {
        agg_calls: [],
        group_keys: [
            Plus(InputRef #0, InputRef #1),
            Plus(InputRef #2, InputRef #3),
        ],
       ........................

when we get column_names, column_names[0] should be c1 + c2, column_names[1] == c3 + c4
somaybe we should change this line

.map(|expr| child_schema[expr.index].clone())

to corresponding form(maybe we can get form from self.group_keys)

@skyzh
Copy link
Member

skyzh commented Apr 13, 2022

to corresponding form(maybe we can get form from self.group_keys)

Yes, exactly! You can find out how to generate c1 + c2 from the expression input ref #0 + input ref #1.

@skyzh
Copy link
Member

skyzh commented Apr 13, 2022

For the first step, we can keep it simple -- if the group_keys is not a simple InputRef, we simply use Plus(InputRef #0, InputRef #1) as the column name. In the future, we can add a function like format_with_column_desc(&self, column_desc: Vec<ColumnDesc>) on BoundExpr, so that we can resolve the full name.

skyzh pushed a commit that referenced this issue Apr 17, 2022
* executor: use DataChunkbuilder to refactor ValuesExecutor (#623)

* use DataChunkbuilder to refactor ValuesExecutor
Signed-off-by: Jayice <1185430411@qq.com>

* code format
Signed-off-by: Jayice <1185430411@qq.com>

* code style
Signed-off-by: Jayice <1185430411@qq.com>
Signed-off-by: Shmiwy <wyf000219@126.com>

* fix: make header correct when using group by(#538)

Signed-off-by: Shmiwy <wyf000219@126.com>

* fix: map group_keys to ColumnDesc

Signed-off-by: Shmiwy wyf000219@gamil.com
Signed-off-by: Shmiwy <wyf000219@126.com>

* fix: use expr's return type instead of child schema's

Signed-off-by: Shmiwy wyf000219@gamil.com
Signed-off-by: Shmiwy <wyf000219@126.com>

* build: bump toolchain and deps, remove unused dep (#629)

Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: Shmiwy <wyf000219@126.com>

Co-authored-by: JAYICE <49588871+JayiceZ@users.noreply.github.com>
Co-authored-by: TennyZhuang <zty0826@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants