-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathacfits.tex
1773 lines (1468 loc) · 84.2 KB
/
acfits.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% This is file `elsarticle-template-2-harv.tex',
%%
%% Copyright 2009 Elsevier Ltd
%%
%% This file is part of the 'Elsarticle Bundle'.
%% ---------------------------------------------
%%
%% It may be distributed under the conditions of the LaTeX Project Public
%% License, either version 1.2 of this license or (at your option) any
%% later version. The latest version of this license is in
%% http://www.latex-project.org/lppl.txt
%% and version 1.2 or later is part of all distributions of LaTeX
%% version 1999/12/01 or later.
%%
%% The list of all files belonging to the 'Elsarticle Bundle' is
%% given in the file `manifest.txt'.
%%
%% Template article for Elsevier's document class `elsarticle'
%% with harvard style bibliographic references
%%
%% $Id: elsarticle-template-2-harv.tex 155 2009-10-08 05:35:05Z rishi $
%% $URL: http://lenova.river-valley.com/svn/elsbst/trunk/elsarticle-template-2-harv.tex $
%%
%%\documentclass[preprint,authoryear,12pt]{elsarticle}
%% Use the option review to obtain double line spacing
%% \documentclass[authoryear,preprint,review,12pt]{elsarticle}
%% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn
%% for a journal layout:
%% Astronomy & Computing uses 5p
%% \documentclass[final,authoryear,5p,times]{elsarticle}
\documentclass[final,authoryear,5p,times,twocolumn]{elsarticle}
%% if you use PostScript figures in your article
%% use the graphics package for simple commands
%% \usepackage{graphics}
%% or use the graphicx package for more complicated commands
% \usepackage{graphicx}
%% or use the epsfig package if you prefer to use the old commands
%% \usepackage{epsfig}
%% The amssymb package provides various useful mathematical symbols
% \usepackage{amssymb}
\usepackage{gensymb}
%% The amsthm package provides extended theorem environments
%% \usepackage{amsthm}
\usepackage[pdftex,pdfpagemode={UseOutlines},bookmarks,bookmarksopen,colorlinks,linkcolor={blue},citecolor={green},urlcolor={red}]{hyperref}
%% Draft mode for editing
%\usepackage[pdftex,pdfpagemode={UseOutlines},bookmarks,bookmarksopen,colorlinks,linkcolor={blue},citecolor={green},urlcolor={red},draft]{hyperref}
% \usepackage{hypernat}
\usepackage{breakurl}
\usepackage{listings}
\lstset{
language=C,
basicstyle=\ttfamily,
showstringspaces=false,
stringstyle=\color{red},
morecomment=[l]{/},
commentstyle=\color{blue},
}
%% The lineno packages adds line numbers. Start line numbering with
%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
%% for the whole article with \linenumbers after \end{frontmatter}.
%% \usepackage{lineno}
%% natbib.sty is loaded by default. However, natbib options can be
%% provided with \biboptions{...} command. Following options are
%% valid:
%% round - round parentheses are used (default)
%% square - square brackets are used [option]
%% curly - curly braces are used {option}
%% angle - angle brackets are used <option>
%% semicolon - multiple citations separated by semi-colon (default)
%% colon - same as semicolon, an earlier confusion
%% comma - separated by comma
%% authoryear - selects author-year citations (default)
%% numbers- selects numerical citations
%% super - numerical citations as superscripts
%% sort - sorts multiple citations according to order in ref. list
%% sort&compress - like sort, but also compresses numerical citations
%% compress - compresses without sorting
%% longnamesfirst - makes first citation full author list
%%
%% \biboptions{longnamesfirst,comma}
% \biboptions{}
\journal{Astronomy \& Computing}
%% Upright single quotes in verbatim fields make FITS header examples
%% much more readable
\usepackage{upquote}
%% For draft use color package to indicate open questions that need
%% clarification
\usepackage{color}
%\defcitealias{2015Jenness}{Paper~II}
%\defcitealias{2014Kitaeff}{Paper~III}
\begin{document}
\begin{frontmatter}
%% Title, authors and addresses
%% use the tnoteref command within \title for footnotes;
%% use the tnotetext command for the associated footnote;
%% use the fnref command within \author or \address for footnotes;
%% use the fntext command for the associated footnote;
%% use the corref command within \author for corresponding author footnotes;
%% use the cortext command for the associated footnote;
%% use the ead command for the email address,
%% and the form \ead[url] for the home page:
%%
%% \title{Title\tnoteref{label1}}
%% \tnotetext[label1]{}
%% \author{Name\corref{cor1}\fnref{label2}}
%% \ead{email address}
%% \ead[url]{home page}
%% \fntext[label2]{}
%% \cortext[cor1]{}
%% \address{Address\fnref{label3}}
%% \fntext[label3]{}
\title{Learning from FITS: Limitations in use in modern astronomical research}
%% use optional labels to link authors explicitly to addresses:
%% \author[label1,label2]{<author name>}
%% \address[label1]{<address>}
%% \address[label2]{<address>}
\author[noao]{Brian~Thomas\corref{cor1}}
\ead{bthomas@noao.edu}
\author[cornell]{Tim~Jenness}
\author[noao]{Frossie~Economou}
\author[stsci]{Perry~Greenfield}
\author[geminin]{Paul~Hirst}
\author[jac]{David~S.~Berry}
\author[stsci]{Erik~Bray}
\author[glasgow]{Norman~Gray}
\author[ohio]{Demitri~Muna}
\author[geminis]{James~Turner}
\author[princeton]{Miguel~de~Val-Borro}
\author[iaa,ska]{Juande~Santander-Vela}
\author[ipac]{David~Shupe}
\author[ipac]{John~Good}
\author[ipac]{G.~Bruce~Berriman}
\author[icrar]{Slava~Kitaeff}
\author[microsoft]{Jonathan~Fay}
\author[sao]{Omar~Laurino}
\author[stsci]{Anastasia~Alexov}
\author[ipac]{Walter~Landry}
\author[nrao]{Joe~Masters}
\author[cornell]{Adam~Brazier}
\author[aifa]{Reinhold~Schaaf}
\author[uwaterloo]{Kevin~Edwards}
\author[jac]{Russell~O.~Redman}
\author[warwick]{Thomas~R.~Marsh}
\author[aip]{Ole~Streicher}
\author[noao]{Pat~Norris}
\author[ucm]{Sergio~Pascual}
\author[unsw]{Matthew~Davie}
\author[stsci]{Michael~Droettboom}
\author[mpia]{Thomas~Robitaille}
\author[iasf]{Riccardo~Campana}
\author[psu]{Alex~Hagen}
\author[mps]{Paul~Hartogh}
\author[aifa]{Dominik~Klaes}
\author[msum]{Matthew~W.~Craig}
\author[cral]{Derek~Homeier}
\cortext[cor1]{Corresponding author}
\address[noao]{Science Data Management, National Optical Astronomy Observatory, 950 N Cherry Ave, Tucson, AZ 85719, USA}
\address[cornell]{Department of Astronomy, Cornell University, Ithaca,
NY 14853, USA}
\address[stsci]{Space Telescope Science Institute, 3700 San Martin Drive, Baltimore, MD 21218, USA}
\address[geminin]{Gemini Observatory, 670 N.\ A`oh\=ok\=u Place, Hilo, HI 96720, USA}
\address[jac]{Joint Astronomy Centre, 660 N.\ A`oh\=ok\=u Place, Hilo, HI 96720, USA}
\address[glasgow]{SUPA School of Physics \& Astronomy, University of Glasgow, Glasgow, G12 8QQ, UK}
\address[ohio]{Department of Astronomy, The Ohio State University, Columbus, OH 43210, USA}
\address[geminis]{Gemini Observatory, Casilla 603, La Serena, Chile}
\address[princeton]{Department of Astrophysical Sciences, Princeton University, Princeton, NJ 08544, USA}
\address[iaa]{Instituto de Astrof\'isica de Andaluc\'ia, Glorieta de la Astronom\'ia s/n, E-18008, Granada, Spain}
\address[ska]{Square Kilometre Array Organisation, Jodrell Bank Observatory, Lower Withington, Macclesfield SK11~9DL, UK}
\address[ipac]{Infrared Processing and Analysis Center, Caltech, Pasadena, CA 91125, USA}
\address[icrar]{International Centre for Radio Astronomy Research, M468, 35 Stirling Hwy, Crawley, Perth WA 6009, Australia}
\address[microsoft]{Microsoft Research, 14820 NE 36th Street, Redmond, WA 98052, USA}
\address[sao]{Smithsonian Astrophysical Observatory, 60 Garden Street, Cambridge,
MA 02138, USA}
\address[nrao]{National Radio Astronomy Observatory, 520 Edgemont Road,
Charlottesville, VA 22903, USA}
\address[aifa]{Argelander-Institut f\"{u}r Astronomie, Universit\"{a}t Bonn, Auf dem H\"{u}gel 71, 53121 Bonn, Germany}
\address[uwaterloo]{Department of Physics, University of Waterloo, Waterloo, ON N2L~3G1, Canada}
\address[warwick]{Department of Physics, University of Warwick, Coventry CV4 7AL, UK}
\address[aip]{Leibniz-Institut f\"{u}r Astrophysik Potsdam (AIP), An der Sternwarte 16, 14482 Potsdam, Germany}
\address[ucm]{Departamento de Astrof\'{i}sica, Universidad Complutense de Madrid, 28040, Madrid, Spain}
\address[unsw]{Department of Astrophysics, School of Physics,
University of New South Wales, Sydney, NSW 2052, Australia}
\address[mpia]{Max-Planck-Institut f\"{u}r Astronomie, K\"{o}nigstuhl 17, 69117 Heidelberg, Germany}
\address[iasf]{Institute for Space Astrophysics and Cosmic Physics, Via Piero Gobetti 101, Bologna, I-40129, Italy}
\address[psu]{Dept.\ of Astronomy and Astrophysics, The Pennsylvania
State University, 525 Davey Lab, University Park, PA 16802, USA}
\address[mps]{Max-Planck-Institut f\"{u}r Sonnensystemforschung,
Justus-von-Liebig-Weg 3, 37077 G\"{o}ttingen, Germany}
\address[msum]{Department of Physics and Astronomy, Minnesota State University Moorhead, 1104 7th Ave. S., Moorhead, MN 56563, USA}
\address[cral]{Centre de Recherche Astrophysique de Lyon, UMR 5574, CNRS,
Universit\'{e} de Lyon, ENS Lyon, %% \'{E}cole Normale Sup\'{e}rieure de Lyon,
46 All\'{e}e d'Italie, 69364 Lyon Cedex 07, France}
\begin{abstract}
%% Text of abstract
The Flexible Image Transport System (FITS) standard has been a great
boon to astronomy, allowing observatories, scientists and the public
to exchange astronomical information easily. The FITS standard,
however, is showing its age. Developed in the late 1970s, the FITS
authors made a number of implementation choices that, while common at
the time, are now seen to limit its utility with modern data. The
authors of the FITS standard could not anticipate the challenges which
we are facing today in astronomical computing. Difficulties we now
face include, but are not limited to, addressing the need to
handle an expanded range of specialized data product types (data
models), being more conducive to the networked exchange and storage of
data, handling very large datasets, and capturing
significantly more complex metadata and data relationships.
There are members of the community today who find some or all of
these limitations unworkable, and have decided to move ahead with
storing data in other formats. If this fragmentation continues, we
risk abandoning the advantages of broad interoperability, and ready
archivability, that the FITS format provides for astronomy.
In this paper we detail some
selected important problems which exist within the FITS standard
today. These problems may provide insight into deeper underlying
issues which reside in the format and we provide a discussion of
some lessons learned. It is not our intention here to prescribe specific remedies to
these issues; rather, it is to call attention of the FITS and
greater astronomical computing communities to these problems in the
hope that it will spur action to address them.
\end{abstract}
\begin{keyword}
%% keywords here, in the form: keyword \sep keyword
%% MSC codes here, in the form: \MSC code \sep code
%% or \MSC[2008] code \sep code (2000 is the default)
FITS \sep
File formats \sep
Standards
\end{keyword}
\end{frontmatter}
% \linenumbers
\newcommand{\ascl}[1]{\href{http://www.ascl.net/#1}{ascl:#1}}
\newcommand{\aspconf}{ASP Conf.\ Ser}
\newcommand{\aap}{A\&A}
\newcommand{\aaps}{A\&AS}
\newcommand{\jrasc}{JRASC}
\newcommand{\qjras}{QJRAS}
\newcommand{\mnras}{MNRAS}
\newcommand{\pasp}{PASP}
\newcommand{\pasa}{PASA}
\newcommand{\apjs}{ApJS}
%% main text
\section{Introduction}
The Flexible Image Transport System standard (FITS;
\citealt{1979ipia.coll..445W,1980SPIE..264..298G,1981A&AS...44..363W,1981A&AS...44..371G} and
\citealt{2001A&A...376..359H}; and more recently, the definition of the
version 3.0 FITS standard by \citealt{2010A&A...524A..42P}) has been a
fundamental part of astronomical computing for a significant part of the
past four decades. The FITS format became the central means to store and
exchange astronomical data, and because of hard work by the FITS
community it has become a relatively easy exercise for application
writers, archivists, and end user scientists to interchange data and
work productively on many computational astronomy problems. The success
of FITS is such that it has even spread to other domains such as medical
imaging and digitizing manuscripts in the Vatican Library
\citep{2006JRASC.100..242W,2012EWASSAlle}.
Although there have been some significant changes, the FITS standard
has evolved very slowly since its genesis in the late 1970s. New types
of metadata conventions such as World Coordinate System
\citep[WCS;][]{2002A&A...395.1061G,2002A&A...395.1077C,2006A&A...446..747G}
representation and data serializations such as variable length binary
tables \citep{1995A&AS..113..159C} have been added. Nevertheless,
these changes have not been sufficient to match the greater evolution
in astronomical research over the same period of time.
Astronomical research now goes beyond the paradigm of a set of
observational data being analyzed only by the scientific team who
proposed or collected it. The community routinely combines original
observations, theoretical calculations, observations from others, and
data from archives on the internet in order to obtain better and wider
ranging scientific results. A wide variety of research projects now involve many
diverse datasets from a range of sources. Instruments in astronomy
now produce several orders of magnitude larger datasets than were common
at the time FITS was born, in some cases requiring parallelized,
distributed storage systems to provide adequate data rates
\citep{2012ASPC..461..283A}.
Astronomers have increasingly come to rely on others to write software
programs to help process and analyze their data. Common libraries, analysis
environments, pipeline processed data, applications and services
provided by third parties form a crucial foundation for many
astronomers' toolboxes. All of this requires that the interchange of
data between different tools needs to be as automated as possible, and
that complex data models and metadata used in processing are
maintained and understood through the interchange.
These changes in research practices pose new challenges for the
21\textsuperscript{st} century. We must address the need to handle an
expanded range of specialized data product types and models, be more
conducive to the distributed exchange and storage of data, handle very
large datasets and provide a means to capture significantly more complex
metadata and data relationships.
A summary of these significant problems within the FITS standard was
presented in \citet{P90_adassxxiii}. Already some of these limitations
have caused members of the community to seek more capable storage
formats, both in the past, such as the Starlink Hierarchical Data System
\citep[HDS;][]{1982QJRAS..23..485D,2015HDS}, the eXtensible Data Format
\citep[XDF;][]{2001ASPC..238..217S}, FITSML \citep{2001ASPC..238..487T}
and HDX \citep{2003ASPC..295..221G}; and in the present and future
(e.g., HDF5 \citep{2011ASPC..442...53A} and NDF \citep[][\ascl{1411.023}]{2015Jenness}).
There are other popular file formats among the
radio and (sub-)millimeter astronomy community such as the Continuum and
Line Analysis Single-dish Software (CLASS) data format associated with
the Grenoble Image and Line Data Analysis Software (GILDAS) tools
(\ascl{1305.010}). Although this file
format does not have a public specification, there are open-source
spectroscopic software packages like \texttt{PySpecKit}
(\ascl{1109.001}) that support certain
versions of the data format. Given the large amount of available
storage formats, there is certainly a possibility that the use of FITS
will fall in favor of other scientific data formats should it not adapt
to these new challenges.
The strengths of FITS are well known and include an easily understood
serialization, a plethora of stable supporting software, good documentation of
the format and the simple fact that it remains to this day the \emph{lingua
franca} of astronomical data format exchange. What we feel has been missing
is an attempt within the community to
dispassionately discuss and understand FITS in terms of problems in
its application to modern astronomical research. In this paper we hope to show
that technologies and research techniques in astronomy have evolved but FITS has
not kept pace. As a result, gaps between FITS utility and the needs of the
research community have opened up and widened over time.
It is our intended goal in this paper to highlight some selected, important,
problems which exist in the FITS core standard today. We have deliberately
avoided proposing solutions to the problems we discuss, and we remain agnostic
(because the authors are divided) on whether replacing FITS is an obviously good
or an obviously bad idea.
We present our argument in the following manner. The various issues have been grouped
under the general topics ``information interchange'' (section~\ref{section_poor_exchange}),
``data models'' (section~\ref{section_crit_data_models}),
``metadata and data representation'' (section~\ref{section_inflex_represent}) and
``large and/or distributed datasets'' (section~\ref{section_poor_large_data_support}).
We address each of these topics in turn below then try to provide an
analysis of any deeper causes or ``lessons learned'' (section~\ref{sec:discussion}).
A summary section ends the paper and provides an overview of our work and future
direction.
\section{Information Interchange}
\label{section_poor_exchange}
FITS originated as a delivery format for observatory data. It was the format
of choice when transporting data between different data reduction
environments such as IRAF (\ascl{9911.002}),
Starlink (\ascl{1110.012}), AIPS
(\ascl{9911.003}) and MIDAS
(\ascl{1302.017}).
In principle, FITS promotes interchange through its simple and easily
understood format which holds its information in various levels of groupings
of metadata and data blocks. Metadata are captured via key-value pairs which
are in turn grouped into FITS headers. The first header is denoted as the
`primary' header and subsequent headers known as `extensions'. Headers may or
may not be then grouped with data blocks. An example primary FITS header
appears in Fig.~\ref{fig:fitshead}.
\begin{figure*}
\begin{minipage}{\textwidth}
\begin{lstlisting}
SIMPLE = T / Standard FITS format
BITPIX = -32 / 32 bit IEEE floating point numbers
NAXIS = 3 / Number of axes
NAXIS1 = 800 /
NAXIS2 = 800 /
NAXIS3 = 4 /
EXTEND = T / There may be standard extensions
ATODGAIN= 7.000000 / Analog to Digital Gain (Electrons/DN)
RNOISE = 1.010153 / Readout Noise (DN)
EPOCH = 49740.82869315 / exposure average time (Modified Julian Date)
EXPTIME = 2500.000000 / exposure duration (seconds)--calculated
EXP0 = 1300.000000 / weighted average initial exposure time
RSDPFILL= -250 / bad data fill value for calibrated images
SATURATE= 10237 / Data value at which saturation occurs
TEMP = 0 / Temperature (0=cold, 1=warm)
FILTNAM1= 'F555W ' / first filter name
HSTPHOT = T / Preprocessed by HSTphot/mask
END
\end{lstlisting}
\caption{Representative simple primary header of a FITS file showing
an assortment of FITS keywords and their associated values. This
header from 1995 uses a definition of the, now deprecated,
\texttt{EPOCH} keyword that is at odds with the standard usage of
the period but the lack of parsable units for the field make it
hard for a computer parser to understand this.
Bytes which contain data may or may not follow the \texttt{END} keyword of
the header.}
\label{fig:fitshead}
\end{minipage}
\end{figure*}
This simple arrangement of information can satisfy many use cases for
transport, however, requirements for interchange have evolved. Effective
interchange, as we shall illustrate, now includes things like the ability to
declare models for use in higher level processing, validation of models within
the file and, at the most basic level, the ability to declare which version of
the serialization is being used.
These capabilities have been explored and implemented in several other data formats in astronomy.
%These needed capabilities have been noticed by others and, in fact, members
%of the astronomical community have designed formats to satisfy some
%of these requirements.
The Astronomical Data Center (ADC) XDF
format, the Low-Frequency Array for Radio Astronomy (LOFAR) HDF5 data
model \citep{2012ASPC..461..283A}, CASA measurement sets
\citep{2012ASPC..461..849P},
RPFITS\footnote{\url{http://www.atnf.csiro.au/computing/software/rpfits.html}
-- RPFITS is an incompatible fork of FITS \citep[see e.g.,][]{1998ASPC..145...32B}.} from the Australian Telescope
National Facility
and Starlink's NDF
\citep{1988STARB...2...11C,1993ASPC...52..229W,P91_adassxxiii} all
serve as examples in this regard.
XDF was created primarily to support archiving, web-based use of
published astronomical data and the development of FITSML -- an XML version
of the FITS data model which could use an XML schema for validation.
NDF was developed in the late
1980s as a means of organizing the hierarchical structures that were
available via the Starlink HDS format when it became apparent that
arbitrary hierarchies could lead to chaos and lack of ability for
applications to interoperate \citep{2015Jenness}.
HDX \citep{2003ASPC..295..221G} was developed around 2002 as a flexible
way of layering high-level data structures, presented as a virtual XML
Document Object Model (DOM), atop otherwise unstructured external data stores; this was in
turn used to develop Starlink's NDX framework, which (among other
things) allowed FITS files to be viewed and manipulated using the
concepts of the NDF format.
HDF5 \citep{2012ASPC..461..283A} was chosen to accommodate LOFAR's
exceptional high data rates, 6-dimensional data complexity, distributed
data processing and I/O parallelization needs.
\subsection{Format versioning}
\label{subsection_format_versioning}
There is no standard means for a FITS file to communicate
the formatting version it conforms to. Consider the example primary
header in Fig.~\ref{fig:fitshead}: the only keyword which implies any
type of format is SIMPLE which is set to `T', or true. The comment
indicates that the file conforms to ``Standard FITS format'', but what
indeed is that `Standard'?
The designers and maintainers of FITS have espoused the principle
``once FITS, forever FITS''
\citep[see e.g.,][]{1988A&AS...73..359G,1993FITS1}. Certainly some in the
community see this as a strength for the format as it appears to
promote long term stability and ``archivability'' of FITS data
\citep{2012EWASSAlle,2012LOC}. This is not, however, quite the same
thing as saying that FITS is unversioned. There have been at least
three named descriptions of FITS. These include the first, or `basic
FITS' document \citep{1979ipia.coll..445W,1981A&AS...44..363W}, the
NOST version of FITS \citep{2001A&A...376..359H}, and the current
version 3.0 \citep{2010A&A...524A..42P}. One can regard these as
successive improvements of a document describing changing best
practices for an unchanging format (compare ``the value [of the
putative FITS version keyword] is always 1.0 by default'' in
\citet{1997ASPC..125..257W}, which discusses this general point in
some depth). However the fact remains that there are features in the
most recent FITS description (such as 64-bit integers, negative
\texttt{BITPIX} values, FITS extensions and tables) which were not
present in the first FITS version and demonstrably FITS has evolved.
The ``once FITS, forever FITS'' doctrine may be taken to require
backward and forward compatibility or, if you will, compatibility
with all FITS files ever created in the case where there is only
one version ever.
Either way, backward compatibility means that it always
\emph{should} be feasible to use the most recent FITS reader. For
forward compatibility, at minimum, reasonable expectation goes
beyond requiring a FITS reader not crash when confronted with a newer
FITS file; it should do more than this. Ideally, it should parse
what parts of the file are still compliant with
its understanding of the format and report on those parts/features
of the file which it does not recognize. In either compatibility case,
without unambiguous version metadata, readers have to rely on `duck-typing'
\footnote{see \url{http://en.wikipedia.org/wiki/Duck\_typing}}
and heuristics which are ultimately error prone because it requires
the implementer of the parser to perfectly interpret the signature
of any particular set of features present in the given FITS instance
from among other possible features which are absent. Furthermore, as
the format evolves beyond the date of its creation, the software cannot
know how that signature may change and may incorrectly
identify the version, a clear difficulty for forward compatibility.
The reliance on heuristics also has impact beyond writing a FITS parser.
Future archivists will certainly want to know what version of the
format they are dealing with without having to guess from ancillary
evidence such as the presence of certain keywords, date of the file
creation and so on.
The lack of versioning also limits the ability of our community to
move forward constructively with developing new FITS versions.
The ``once FITS, forever FITS'' doctrine requires we accrete
more and more ``design rules'' which may limit our ability to implement
new and needed features and clutter reader code. Consider that three keywords
have been deprecated
(\texttt{BLOCKED}, \texttt{CROTA2} and \texttt{EPOCH}) by the latest version
of FITS. Per the standard, these are ``obsolete structures that should not be
used in new FITS files but which shall remain valid indefinitely''.
As such, software writers must indefinitely be on guard for these metadata
and writers of new conventions must avoid utilizing these specific keywords.
As time passes and changes of this nature accumulate, it will be progressively
harder to interpret FITS data correctly and write new conventions.
Although the FITS format is apparently rather simple, on disk, the
multiple versions of the format description, and the existence of
numerous header conventions, mean that reading a FITS file in full
generality is a complicated and messy business. As there is no
versioning mechanism to effectively declare deprecated structures
finally ``illegal'', these complications and costs will only
increase.
%In consequence, it may be time to consider whether the
%(otherwise clear) benefits of the ``once FITS, forever FITS'' doctrine
%continue to outweigh the costs.
\subsection{Declaration and validation of content meaning}
\label{subsection_semantic_validation_and_declaration}
Related to, but separate from, the lack of versioning of the
serialization, is the lack of ability to declare the presence of data
models and their associated meaning. By `data model' we
mean:
\begin{quote}
``a description of the objects represented by a computer system
together with their properties and relationships; these are typically
`real world' objects such as products, suppliers, customers, and
orders.\footnote{Definition adopted from Wikipedia, see
\url{http://en.wikipedia.org/wiki/Data\_model}}
''
\end{quote}
Of course, objects in astronomy are more likely to involve things like
observations, instruments, celestial coordinates and actual astronomical
objects such as stars. Likely properties one will encounter in a FITS
file include things like observational parameters (start/end times),
astronomical coordinates, name and properties of the observing
instrumentation, and so forth. In FITS-speak, we can say that any FITS
keyword outside those defined in the FITS standard is a data model
parameter, and collections of related FITS keywords form a data model.
Ideally a data model should be associated with a given, unique,
``name\-space'' so that collisions in naming of the models and requisite
parameters are avoided.
Data models can provide a standard by which information (data and metadata)
in the file may be semantically and syntactically validated in software.
Questions such as ``are all of the required metadata/data structures present
in the file?'' (e.g., all of
the needed keywords occur in the correct places in the file) and ``are
there any non-normative values in the file?'' (all metadata/data values
are within expected bounds) are both questions answered by syntactic
validation, the conformance of information in the file to one or more
declared data models. The question of ``how do these data (inter)relate
with other data'' (e.g., can named structures in the file be associated
in some manner with others in another file/extension?) is one of semantic
validation. By confirming that the file is `valid' in both senses, we may
link the data model to the information in the file, and hence answer the
fundamental question ``what does this data you gave me represent?'' (e.g., lists
of stars, tables of galaxies, images of dust clouds, etc).
It is important to note that all of these questions are critical to
consumers of the file.
There is already evidence that the FITS community values and needs shared
data models. There are many examples. WCS and some other FITS conventions
such as OIFITS
\citep{2006SPIE.6268E.106T}, MBFITS \citep{2006A&A...454L..25M},
PSRFITS \citep{2004PASA...21..302H},
SDFITS \citep{2000ASPC..216..243G} and FITS-IDI \citep{2011AIPS114}
are data models. The declaration of keyword
dictionaries\footnote{Some collected data dictionaries with FITS
keywords may be seen at the GSFC FITS site, see
\url{http://fits.gsfc.nasa.gov/fits\_dictionary.html}} is also essentially
an act of declaring one or more data model(s).
Let us also note that it is not unreasonable to expect more than one model
to appear within a file. Consider data distributed by the Palomar Transient
Factory. For these data to permit the widest
variety of software tools to understand the astrometric distortion in these
images, keywords from both the ``SIP'' and ``TPV'' conventions are included
\citep{2012SPIE.8451E..1MS}.
One convention expresses distortion polynomials in pixel space and the
other in intermediate longitude and latitude, yet it is not immediately
obvious which data model should be applied.
All of these data models imply an associated ``namespace'' which is
a means of declaring the origin of the data model so that we may
disambiguate and/or associate declared properties between models.
For example, separate namespaces should exist for the two aforementioned
astrometric distortion models in the example above.
There are common problems which name\-spaced models help to solve and even
the `simple' metadata in Fig.~\ref{fig:fitshead} illustrates this.
Consider the \texttt{TEMP} keyword in the example. Without reading the comment
associated with it, we cannot know if this is this a temperature or perhaps
some type of temporary file or resource or something else. If it is a
temperature then what is this the temperature of? What do the values `0'
and `1' mean? Are these the only valid values for this keyword? \texttt{TEMP} is
a likely keyword string to appear in other files, how do we know if the
\texttt{TEMP} in the other files is the same one we see in the example?
Clearly, it is a non-trivial matter for the machine to determine whether
these are the same properties and to know other important details for using
this information. This problem is not isolated to a solitary bit
of rogue metadata. We can ask similar questions about most of the keywords
in the example header. Namespaced data models help address these issues. With
an appropriate namespace mechanism in place, it is possible to create a
machine-readable mapping between the data models so that any software program
can determine whether \texttt{model1:TEMP} is the same (or different) property as
\texttt{model2:TEMP}.
Namespacing mechanisms can both provide humans with documentation, and
provide software with the means to look up model definitions (perhaps from
remote locations), and thus apply syntactic or semantic validation rules
for the information at hand. This will allow the program to
answer the remainder of our posed questions above.
These arguments indicate there is a pressing need for name\-spaced data models,
yet, the only way in which we can currently implement them is for a human
to inspect the file, or to write special purpose software programs targeted to
particular data models. Given the data volumes that we have in astronomy, the
latter choice is in the direction we should go, but is not practical
in the general case.
The writing of generalized software programs to detect any data models
present in a given FITS file is currently a difficult task for many reasons.
First and foremost, we must recognize that there are constantly new data models
being created and modified. Some of these are documented in a human readable
fashion but there are many more models which do not even meet this standard.
Worse, due in part to the lack of good validation tools, the community has
accepted many informal variants of existing models. These variants may both
be documented or not but are a result of either accidental or intentional
stretching of the original metadata usage. The header in
Fig.~\ref{fig:fitshead}, for example, is an informal variant because of its non-standard
use of the \texttt{EPOCH} keyword.
Finally, there is the possible complication of more than one data model being
fully, or partially, present within a file. Without explicit signposts for
the software to use, it is likely impossible to determine which data models
are present and map information to appropriate meaning.
\section{Data models}
\label{section_crit_data_models}
One of FITS strengths is that it includes some common data structures
which are important in astronomy data. The FITS standard includes such
things like ``table'' and ``n-dimensional array'' the latter which is
used to model both images and data cubes. These items are really simple
representations of the data at a primitive level, and are certainly
needed for basic access to the information within the file. Even so
these structures, by themselves, do not contain much in the way of necessary
detail and semantic information which tells the consumer exactly what
it is they are actually consuming. For this reason, they cannot be considered to
be data models.
The FITS standard does supply a data model, for example the aforementioned
WCS may be considered to be part of it, and these standard semantics are generally
regarded as another strong point of FITS. The other data formats we have
previously mentioned vary in how extensive their core data model is. The range
goes from HDF5, which does not supply any data model \emph{per se}, to that of
NDF which has very rich metadata in its data model.
It is a matter of opinion as to whether more/richer detailed data models in the
format standard are better or not. The NDF core data model metadata are certainly
more detailed than the metadata in the FITS standard. On the other hand FITS is
certainly more widely adopted than NDF. Nevertheless, we believe FITS would benefit
from an expansion in its standard data model as there are certainly common semantics
which may be found in other data formats (e.g. NDF, XDF, etc) and FITS-based model
extensions (e.g. such as MBFITS or local data dictionaries) which the community can
benefit from.
In this section we detail some important missing (component) data models.
\subsection{Scientific Errors}
The measurement of physical properties with their associated uncertainties
is fundamental to astronomical
research. It is thus ironic that FITS, which is purposely designed for
supporting astronomical research, has no standard data model for
capturing information about scientific errors.
We could easily list a great number of possible error types which
might be useful but trying to encompass all of the needs of the
community at once is likely to create an unwieldy data model. We
suggest that the community needs to provide for the most common needs,
and target that subset as a first, shared model. Earlier efforts which
might inform and help this work include local data models at sites
such as CADC \citep{2012ASPC..461..339D} and the error
models implemented in other data formats like NDF
\citep[although see for example][]{1991STARB...8...19M}, and software
efforts underway in scientific programming communities such as Astropy
\citep{2013A&A...558A..33A}. Each of these has valuable insight into
the requirements.
Nevertheless, we can anticipate that the following general
characteristics might be part of the model:
\begin{itemize}
\item Allow for both metadata and data to have errors.
\item Allow for extensible classification of the error type. For example,
``Gaussian'' errors are also a subclass of ``statistical'' errors.
\item Allow association of more than one error class/type per
measurement. For example, allow for both systematic and statistical
errors to be associated with each measurement.
\item Allow for additional properties to be associated with each error
class. For example, ``statistical'' errors may have an assigned ``sigma''
value.
\end{itemize}
\subsection{Extended Coordinate Support}
\label{sec:wcs}
The existing FITS WCS data models illustrate some of the limitations
associated with FITS. The ``once FITS, always FITS'' idea required that
the current WCS standards were developed as an extension of the older
AIPS standard, and so inherited many of the inherent limitations of
that system. Even so they took a long time to be agreed. They are
complex yet incomplete and inflexible. They are inadequate for many
modern telescopes, and restrict creative use of novel coordinate
transformations in subsequent data analysis. For instance, raw data
must handle more distortion issues than the FITS WCS standard
projections can handle. There are some provisions for handling more
arbitrary distortions, but they are either cumbersome or too
simple. Perhaps the biggest limitation is that different
transformations of coordinates cannot be combined in flexible
ways. The user is effectively limited to choosing only one of the solutions
available.
This is unfortunate. Not only does it reduce the range of
transformations that can be described, but it also makes it harder to
decompose the total transformation into its component parts thus making
understanding and manipulation of the total transformation harder. The
alternative approach -- a ``toolkit''-style system that creates complex
transformations by stacking simpler atomic mappings -- is usually the
most efficient representation as far as data storage is concerned
(for example AST, see below).
To illustrate the problem consider the imaging data taken by the
Hubble Space Telescope which require multiple distortion components \citep[see e.g.,][]{2013ASPC..475...49H}.
Some are small but discontinuous. Others are linear but time varying.
There is no FITS WCS compatible solution that handles these needs well.
As another example, SCUBA-2 raw data \citep[see
e.g.,][]{2013MNRAS.430.2513H} include focal plane distortions which are
combined with other transformations but must also support the dynamic
insertion of other distortion models when a Fourier transform
spectrometer \citep{2010SPIE.7741E..67G} is placed in the beam.
Another case with poor support is Integral Field Unit (IFU) data.
Many of these datasets
have discontinuous WCS models. The only way to support these in FITS
now is to explicitly map each pixel to the world coordinates. Besides
being space inefficient, it is difficult to manipulate in any simple
way.
In addition to limiting the description of raw telescope data, FITS
WCS also restricts what can be done with such data during subsequent
analysis. There are many potentially interesting transformations that
would result in the final WCS being inexpressible using the
restrictive FITS model. For instance, transforming an image of an
elliptical galaxy into polar or elliptical coordinates is currently not possible. Another
case which is unworkable is an alternate coordinate system to an image to
represent the pixel coordinates of a second image covering the same
part of the sky. These may not be common requirements, but they
illustrate the wide range of transformation that should be possible
with a flexible WCS system.
The inflexibility in the FITS solution arises from multiple issues,
but lack of namespaces is a serious barrier to providing a more
flexible solution. If one has multiple model components each with
similar parameters, how does one distinguish between them? One may use
the letter suffix, but that is also used to distinguish between
alternate WCS models. The limitation on keyword sizes presents
limitations on how many coefficients can be supported. The lack of any
explicit grouping mechanism requires complex conventions on how to
relate whole sets of keywords. With more modern structures, such
contortions and limitations are not necessary.
The reality is that to solve these problems, many software systems
have chosen alternate solutions and save their WCS information in FITS
files in other ways (or in separate files). For example, the AST
library \citep{1998ASPC..145...41W,2012ASPC..461..825B} is not subject
to these limitations, but
is forced to use non-standard FITS keywords when serializing mappings
to FITS files (see Fig.~\ref{fig:asthead}).
\begin{figure*}
\begin{minipage}{\textwidth}
\begin{lstlisting}
PLRLG_A = 5.50788096462284 / Polar longitude (rad.s)
ENDAST_K= 'SphMap ' / End of object definition
MAPB_A = ' ' / Second component Mapping
BEGAST_O= 'CmpMap ' / Compound Mapping
NIN_D = 3 / Number of input coordinates
NOUT_B = 2 / Number of output coordinates
INVERT_C= 0 / Mapping not inverted
ISA_I = 'Mapping ' / Mapping between coordinate systems
INVA_B = 1 / First Mapping used in inverse direction
MAPA_C = ' ' / First component Mapping
BEGAST_P= 'MatrixMap' / Matrix transformation
NIN_E = 3 / Number of input coordinates
INVERT_D= 1 / Mapping inverted
ISA_J = 'Mapping ' / Mapping between coordinate systems
M0_A = 0.426766777415161 / Forward matrix value
M1_A = 0.699933471661958 / Forward matrix value
M2_A = 0.572680760059142 / Forward matrix value
M3_A = -0.418237169285184 / Forward matrix value
\end{lstlisting}
\caption{Example header of a representation of an AST WCS object in a
FITS header when the mapping is too complex to be represented using the
FITS-WCS standard.}
\label{fig:asthead}
\end{minipage}
\end{figure*}
\subsection{History and Provenance}
\label{sec:history}
The FITS standard encourages people to store processing history
information in the header using a pseudo-comment field named
\texttt{HISTORY}. This works from the perspective of making the information
available to a sufficiently interested human (assuming that each step in the
data processing adds information to the end of the history section of the
header) but the free-form nature of the entries makes it essentially
impossible for a software system to understand what was done to the data.
This may be possible within the constraints of a single data reduction
environment but it is highly unlikely that the content of the
\texttt{HISTORY} block can be understood by any other software packages. History
needs to be treated as a first-class citizen with a standardized way of
registering important information such as the date, the software tool and
any relevant arguments or switches.
A related issue is data provenance; that is, sufficient records of how files
were created to permit their reproduction. For a given processed data product
it is, for example, impossible to determine which data files
contributed to the creation of that product. While there is no metadata standard
for specifying this
information in output files, experimental systems have been developed which, when fully developed,
aim to offer programmatic interfaces that will
simplify recording provenance information. One such example is
Provenance Aware Service Oriented Architecture
\citep[PASOA;][]{2008IPAWMoreau,2011743Moreau}, an open source architecture
already used in fields such as aerospace engineering. In brief, when
applications are executed they produce documentation of the process recorded
in a repository of provenance records that are persisted in a database. In
astronomy, PASOA was successfully demonstrated by integrating it into the
Pegasus workflow management system for running the Montage mosaic engine
\citep{2009SCGroth}.
At the JCMT Science Archive \citep[JSA;][]{2008ASPC..394..135G,2015Economou} data are
created with full provenance information using the native provenance
tracking that is part of NDF \citep{2009ASPC..411..418J}. This
provenance includes every ancestor along with history information that
contributed to each ancestor. When these files are converted to FITS
for ingestion into the JSA using the CAOM-2 data model
\citep{2013ASPC..475..159R} the provenance is trimmed to include
just the immediate parent files (using \texttt{PRVnnnnn} headers)
and the observation identifiers of the root ancestor observations
(using \texttt{OBSnnnnn} headers). The full richness of the provenance
information is available in FITS binary tables but the lack of a standard
leaves this information hidden from applications other than the ones
that created it originally.
Finally, astronomy may benefit from methodologies used to develop provenance
systems custom to Earth Science and remote sensing
\citep{2008IPAWTilmes,2008IPAWMcCann}.
\subsection{Data Quality}
One of the more pressing needs in our era of shared and distributed
data is the need to know which data are ``good'' or, to put it another
way, of sufficient quality. We are long past the era when the data
volume was so small that it is practical to download all of the possible
data of interest and examine it locally.
Some might insist that this is an easily solved problem. Simply
declare a keyword, like \texttt{DQUALITY}, and allow it to take a boolean
value. To be sure, that example is an exaggeration, but it helps to
illustrate that there is no single optimum between the virtue of
simplicity and the vice of being simplistic.
Data quality cannot be judged on a single, or even a
small set, of parameters. The
data which are adequate for one type of use, may be wholly inadequate
in another usage context. Consider that engineering data generally are
unsuitable for science and vice versa. Science data may be
unsuitable for other types of science (for example, studies of sky
background vs.\ pointed source science).
A data quality model then, should be an ensemble of common statistical
measures of the type of dataset which may be used to derive
higher-level judgments of the quality/suitability of the data for
some other declared purpose. There are many higher types of data
quality models which will need be created from the lower-level
measures (image data quality, pointed catalog data quality, etc) and
from these particular, targeted, statistical measures data quality may
be judged by the dataset consumer without directly examining the data
themselves.
\subsection{Units}
\label{sec_units}
A strength of FITS is that it includes support for units within its core
standard. There are, however, limitations in the utility of the provided
specification.