-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Fix JSON orient='table' issues with numeric column names #25488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25488 +/- ##
==========================================
+ Coverage 91.75% 91.75% +<.01%
==========================================
Files 173 173
Lines 52955 52960 +5
==========================================
+ Hits 48588 48594 +6
+ Misses 4367 4366 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25488 +/- ##
==========================================
- Coverage 92% 91.99% -0.01%
==========================================
Files 175 175
Lines 52371 52381 +10
==========================================
+ Hits 48184 48190 +6
- Misses 4187 4191 +4
Continue to review full report at Codecov.
|
@@ -1204,9 +1204,10 @@ def test_data_frame_size_after_to_json(self): | |||
|
|||
@pytest.mark.parametrize('index', [None, [1, 2], [1., 2.], ['a', 'b'], | |||
['1', '2'], ['1.', '2.']]) | |||
@pytest.mark.parametrize('columns', [['a', 'b'], ['1', '2'], ['1.', '2.']]) | |||
@pytest.mark.parametrize('columns', [None, [1, 2], [1., 2.], ['a', 'b'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't know that we want to do this. Is it valid JSON in the table spec to have column names that are non-string?
Understood you have gotten this to round trip but if it violates the Table spec for JSON then I'd rather raise as commented previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd as I already commented in #19129, after having tried to start a discussion before doing the PR:
I was suggesting not just "raising a more descriptive ValueError" (sic), but changing the implementation of the JSON serialization for orient='table'.
Could you please tell me where the JSON table spec claims that column names MUST be strings?
I am going to make a longer comment to further justify my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please tell me where the JSON table spec claims that column names MUST be strings?
Not specific to the table spec as much as just JSON itself. See the description of an object here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd you are talking about different things:
- JSON spec imposes that in a JSON object (composed of key-value pairs), its keys must be strings
- but we are talking about column names, not JSON object keys
And my question is: where the JSON table spec claims that COLUMN NAMES (not JSON object keys) must be strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwalsh can you comment on this? Is the name
field in a Field Descriptor expected to be a string?
Hello, here my explanation to justify this PR. As I already commented in #19129, IIUC, the only aim of orient='target' is to get round-trip JSON serialization/deserialization of pandas objects. Thus, the 'schema' was added following the "Table Schema" spec in https://frictionlessdata.io/specs/table-schema/. This spec is ONLY for the 'schema', not for the table itself. And this spec does not prohibit the column names to be non-strings. Indeed, when serializing pandas DataFrames with non-string column names, their 'schema' is totally valid, and the issues arise just because of the JSON serialization of the table itself (the 'data'). In order to serialize the data itself, among all the possible orient values, it was chosen orient='records' and this is the cause of all the issues. Why? Because orient='records' serializes each DataFrame row as a JSON object (roughly equivalent to a Python dict), using the column names as the JSON object names (dict keys). The choice of using column names as dict keys (JSON object names) is not right for several reasons:
In order to address all theses issues, this PR proposes to use another representation of each DataFrame row: instead of using a Python dict, we should use a Python list. The Python list:
And in pandas, this can be easily made by choosing the orient='values' (instead of orient='records'). Finally, I would like to point out that there exists a Tabular Data Resource spec (https://frictionlessdata.io/specs/tabular-data-resource/) and it clearly establishes (in its section JSON Tabular Data):
This PR proposes to use the first option (instead of the second one) for all the previously mentioned reasons. |
OK I see your point here though so thanks for clarifying. Will review in more detail later and give you feedback just wanted to reach out in the interim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not averse to doing this, but this completely breaks backward compatibility. So. update the version on the table. Also I think need to save some samples using the original format and add some tests to be able to read them (dispatch on the version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't just change the output like this without a deprecation.
In retrospect, it seems like we overloaded the orient
keyword to handle two things
- whether or not we include a JSON table schema
- the orient of the data within the output
What do people think about leaving orient
just for the data key, and adding a new keyword schema=True/False
to control whether we include the schema
key at the outer level? I haven't really thought through the deprecation cycle.
What do you see as the advantage of doing this? I would actually be against this as we already have quite a few JSON formats, so adding another that doesn't follow the table spec doesn't add a lot of value I think. |
I’ll need to read the spec again, but it may not make any claims for how the data should be represented.
… On Mar 4, 2019, at 22:25, William Ayd ***@***.***> wrote:
What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle.
What do you see as the advantage of doing this? I would actually be against this as we already have quite a few JSON formats, so adding another that doesn't follow the table spec doesn't add a lot of value I think.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
can you merge master |
1 similar comment
can you merge master |
Not a bad idea but this change is definitely breaking and comes with quite a few implications. Given it's gone somewhat stale over the past month and a half going to close out, but feel free to ping back @albertvillanova if it's something you want to drive forward |
closes #19129
closes #22525
git diff upstream/master -u -- "*.py" | flake8 --diff