\chapter{Data~analysis}
%Peter Hristov is the editor of this chapter.
\label{CH:Data_Analysis}
\section{Introduction}
The analysis of experimental data is the final stage of event
processing and it is usually repeated many times. Analysis is a very diverse
activity, where the goals of each
particular analysis pass may differ significantly.

The ALICE detector is optimized for the reconstruction and analysis of 
heavy-ion collisions. 
In addition, ALICE has a broad physics programme devoted to
\mbox{pp} and \mbox{pA} interactions. 


The main points of the ALICE heavy-ion programme can 
be summarized as follows\cite{CH6Ref:PPR}:

\begin{itemize}
\item {\bf global event characteristics:}  particle multiplicity,
centrality, energy density, nuclear stopping;

\item {\bf soft physics:} chemical composition (particle and resonance
  production, particle ratios and spectra, strangeness enhancement),
  reaction dynamics (transverse and elliptic flow, HBT correlations,
  event-by-event dynamical fluctuations);

\item {\bf hard probes:} jets, direct photons;

\item {\bf heavy flavours:} quarkonia, open charm and beauty production.

\end{itemize}

The \mbox{pp} and \mbox{pA} programme will provide, on the one hand, reference points
for comparison with heavy ions. On the other hand, ALICE will also
pursue genuine and detailed 
\mbox{pp} studies. Some
quantities, in particular the global characteristics of interactions, will
be measured during the first days of running exploiting the low-momentum
 measurement and particle identification capabilities of ALICE. 

\subsection{The analysis activity}
\label{SEC:The_Analysis_Activity}

We distinguish two main types of analysis: scheduled analysis and
chaotic analysis. They differ in their data access pattern, in the
storage and registration of the results, and in the frequency of
changes in the analysis code. The detailed definition is given in
Section~\ref{SEC:Organization_of_the_data_analysis}.

In the ALICE Computing Model the analysis starts from the Event Summary
Data (ESD). These are produced during the reconstruction step and contain
all the information for the analysis. The size of the ESD is
about one order of magnitude lower than the corresponding raw
data.  The analysis tasks produce Analysis
Object Data (AOD) specific to a given set of physics objectives. 
Further passes for the specific analysis activity can be performed on
the AODs, until the selection parameter or algorithms are changed.

A typical data analysis task usually requires processing of
selected sets of events. The selection is based on the event
topology and characteristics, and is done by querying the tag
database (see Chapter~\ref{CH:Overview_of_aliroot_framework}).  The
tags represent physics quantities which characterize 
each run and event, and permit fast selection. They are created
after the reconstruction and contain also the unique
identifier of the ESD file. A typical query, when translated into
natural language, could look like ``Give me  
all the events with impact parameter in $<$range$>$
containing jet candidates with energy larger than $<$threshold$>$''.
This results in a list of events and file identifiers to be used in the
consecutive event loop. 
    
The next step of a typical analysis consists of a loop over all the events
in the list and calculation of the physics quantities of
interest. Usually, for each event, there is a set of embedded loops on the
reconstructed entities such as tracks, ${\rm V^0}$ candidates, neutral
clusters, etc., the main goal of which is to select the signal
candidates. Inside each loop a number of criteria (cuts) are applied to
reject the background combinations and to select the signal ones. The
cuts can be based on geometrical quantities such as impact parameters
of the tracks with 
respect to the primary vertex, distance between the cluster and the
closest track, distance of closest approach between the tracks,
angle between the momentum vector of the particle combination
and the line connecting the production and decay vertices. They can
also be based on  
kinematics quantities such as momentum ratios, minimal and maximal
transverse momentum, 
angles in the rest frame of the combination. 
Particle identification criteria are also among the most common
selection criteria.

The optimization of the selection criteria is one of the most
important parts of the analysis. The goal is to maximize the
signal-to-background ratio in case of search tasks, or another 
ratio (typically ${\rm Signal/\sqrt{Signal+Background}}$) in
case of measurement of a given property.  Usually, this optimization is
performed using simulated events where the information from the
particle generator is available. 

After the optimization of the selection criteria, one has to take into
account the combined acceptance of the detector.  This is a complex,
analysis-specific quantity which depends on the geometrical acceptance,
the trigger efficiency, the decays of particles, the reconstruction
efficiency, the efficiency of the particle identification and of the
selection cuts. The components of the combined acceptance are usually
parametrized and their product is used to unfold the experimental
distributions or during the simulation of some model parameters. 

 The last part of the analysis usually involves quite complex
 mathematical treatments, and sophisticated statistical tools. Here one
 may include the correction for systematic effects, the estimation of
 statistical and systematic errors, etc.

%------------------------------------------------------------------------------

\section{Organization of the data analysis}
\label{SEC:Organization_of_the_data_analysis}

The data analysis is coordinated by the Physics Board via the Physics
Working Groups (PWGs). At present the following PWG have started
their activity: 
\begin{itemize}

\item detector performance;

\item global event characteristics and soft physics (including
  proton--proton physics);

\item hard probes: jets and direct photons;

\item heavy flavours.

\end{itemize}

\noindent
{\bf Scheduled analysis}
\\

\noindent
The scheduled analysis typically uses all
the available data from a given period, and stores and registers the results
using \grid middleware. The tag database is updated accordingly. The
AOD files, generated during the scheduled 
analysis, can be used by several subsequent analyses, or by a class of
related physics tasks. 
The procedure of scheduled analysis is centralized and can be
considered as data filtering. The requirements come from the PWGs and
are prioritized by the Physics Board taking into 
account the available computing and storage resources. The analysis
code is tested in advance and released before the beginning of the
data processing.

Each PWG will require several sets of
AOD per event, which are specific for one or
a few analysis tasks. The creation of the AOD sets is managed centrally.
The event list of each AOD set
will be registered and the access to the AOD files will be granted to
all ALICE collaborators.  AOD files will be generated 
via \grid tools at different computing centres and will be stored on
the corresponding storage 
elements.  The processing of each file set will thus be done in a
distributed way on the \grid. Some of the AOD sets may be quite small
and would fit on a single storage element or even on one computer; in
this case the corresponding tools for file replication, available
in the ALICE \grid infrastructure, will be used.
\\

\noindent
{\bf Chaotic analysis}
\\

\noindent
The chaotic analysis is focused on a single physics task and
typically is based on the filtered data from the scheduled
analysis. Each physicist also
may access directly large parts of the ESD in order to search for rare
events or processes.
Usually the user develops the code using a small subsample
of data, and changes the algorithms and criteria frequently. The
analysis macros and software are tested many times on relatively
small data volumes, both experimental and \MC.
The output is often only a set of histograms. 
Such a tuning of the analysis code can be done on a local
data set or on distributed data using \grid tools. The final version
of the analysis 
will eventually be submitted to the \grid and will access large
portions or even 
the totality of the ESDs. The results may be registered in the \grid file
catalogue and used at later stages of the analysis. 
This activity may or may not be coordinated inside
the PWGs, via the definition of priorities. The
chaotic analysis is carried on within the computing resources of the
physics groups.


%------------------------------------------------------------------------------

\section{Infrastructure tools for distributed analysis}

\subsection{gShell}

The main infrastructure tools for distributed analysis have been
described in Chapter 3. The actual middleware is hidden by an
interface to the \grid, gShell\cite{CH6Ref:gShell}, which provides a
single working shell.  
The gShell package contains all the commands a user may need for file
catalogue queries, creation of sub-directories in the user space,
registration and removal of files, job submission and process
monitoring. The actual \grid middleware  is completely transparent to
the user.

The gShell overcomes the scalability problem of direct client
connections to databases. All clients connect to the
gLite\cite{CH6Ref:gLite} API 
services. This service is implemented as a pool of preforked server
daemons, which serve single-client requests. The client-server
protocol implements a client state which is represented by a current
working directory, a client session ID and time-dependent symmetric
cipher on both ends to guarantee client privacy and security. The
server daemons execute client calls with the identity of the connected
client. 

\subsection{PROOF -- the Parallel ROOT Facility}

The Parallel ROOT Facility, PROOF\cite{CH6Ref:PROOF} has been specially
designed and developed 
to allow the analysis and mining of very large data sets, minimizing
response time. It makes use of the inherent parallelism in event data
and implements an architecture that optimizes I/O and CPU utilization
in heterogeneous clusters with distributed storage. The system
provides transparent and interactive access to terabyte-scale data
sets. Being part of the ROOT framework, PROOF inherits the benefits of
a performing object storage system and a wealth of statistical and
visualization tools. 
The most important design features of PROOF are:

\begin{itemize}
\item transparency -- no difference between a local ROOT and
  a remote parallel PROOF session; 
\item scalability -- no implicit limitations on number of computers
  used in parallel;
\item adaptability -- the system is able to adapt to variations in the
  remote environment.
\end{itemize}

PROOF is based on a multi-tier architecture: the ROOT client session,
the PROOF master server, optionally a number of PROOF sub-master
servers, and the PROOF worker servers. The user connects from the ROOT
session to a master server on a remote cluster and the master server
creates sub-masters and worker servers on all the nodes in the
cluster. All workers process queries in parallel and the results are
presented to the user as coming from a single server.

PROOF can be run either in a purely interactive way, with the user
remaining connected to the master and worker servers and the analysis
results being returned to the user's ROOT session for further
analysis, or in an `interactive batch' way where the user disconnects
from the master and workers (see Fig.~\vref{CH3Fig:alienfig7}). By
reconnecting later to the master server the user can retrieve the
analysis results for that particular 
query. This last mode is useful for relatively long running queries
(several hours) or for submitting many queries at the same time. Both
modes will be important for the analysis of ALICE data.

%\begin{figure}[htb]
%\centering
%\includegraphics*[width=130mm]{chap6fig/proof.eps}
%\caption{Main PROOF components and their interactions.}
%\label{CH6Fig:PROOF}
%\end{figure}

%------------------------------------------------------------------------------

\section{Analysis tools}

This section is devoted to the existing analysis tools in \ROOT and
\aliroot. As discussed in the introduction, some very broad
analysis tasks include the search for some rare events (in this case the
physicist tries to maximize the signal-over-background ratio), or
measurements where it is important to maximize the signal
significance. The tools that provide possibilities to apply certain
selection criteria and to find the interesting combinations within
a given event are described below. Some of them are very general and are
used in many different places, for example the statistical
tools. Others are specific to a given analysis.

\subsection{Statistical tools}

Several commonly used statistical tools are available in
\ROOT\cite{CH6Ref:ROOT}. \ROOT provides 
classes for efficient data storage and access, such as trees
and ntuples. The
ESD information is organized in a tree, where each event is a separate
entry. This allows a chain of the ESD files to be made and the
elaborated selector mechanisms to be used in order to exploit the PROOF
services. Inside each 
ESD object the data is stored in polymorphic containers filled with
reconstructed tracks, neutral particles, etc. The tree classes
permit easy navigation, selection, browsing, and visualization of the
data in the branches. 

\ROOT also provides histogramming and fitting classes, which are used 
for the representation of all the one- and multi-dimensional
distributions, and for extraction of their fitted parameters. \ROOT provides
an interface to powerful and robust minimization packages, which can be 
used directly during some special parts of the analysis. A special
fitting class allows one to decompose an experimental histogram as a
superposition of source histograms.

\ROOT also has a set of sophisticated statistical analysis tools such as
principal component analysis, robust estimator, and neural networks.
 The calculation of confidence levels is provided as well.

Additional statistical functions are included in {\tt TMath}.

\subsection{Calculations of kinematics variables}

The main \ROOT physics classes include 3-vectors and Lorentz
vectors, and operations
such as translation, rotation, and boost. The calculations of
kinematics variables 
such as transverse and longitudinal momentum, rapidity,
pseudorapidity, effective mass, and many others are provided as well.


\subsection{Geometrical calculations}

There are several classes which can be used for
measurement of the primary vertex: {\tt AliITSVertexerZ},
{\tt AliITSVertexerPPZ}, {\tt AliITSVertexerIons},
{\tt AliITSVertexerTracks}. A fast estimation of the {\it z}-position can be
done by {\tt AliITSVertexerZ}, which works for both lead--lead
and proton--proton collisions. An universal tool is provided by
{\tt AliITSVertexerTracks}, which calculates the position and
covariance matrix of the primary vertex based on a set of tracks, and
also estimates the $\chi^2$ contribution of each track. An iterative
procedure can be used to remove the secondary tracks and improve the
precision. 

Track propagation to the primary vertex (inward) is provided in
AliESDtrack.

The secondary vertex reconstruction in case of ${\rm V^0}$ is provided by
{\tt AliV0vertexer}, and in case of cascade hyperons by
{\tt AliCascadeVertexer}.  An universal tool is
{\tt AliITSVertexerTracks}, which can be used also to find secondary
vertices close to the primary one, for example decays of open charm
like ${\rm D^0 \to K^- \pi^+}$ or ${\rm D^+ \to K^- \pi^+ \pi^+}$. All
the vertex 
reconstruction classes also calculate distance of closest approach (DCA)
between the track and the vertex.

The calculation of impact parameters with respect to the primary vertex
is done during the reconstruction and the information is available in
{\tt AliESDtrack}. It is then possible to recalculate the
impact parameter during the ESD analysis, after an improved determination
of the primary vertex position using reconstructed ESD tracks.

\subsection{Global event characteristics}

The impact parameter of the interaction and the number of participants
are estimated from the energy measurements in the ZDC. In addition,
the information  from the FMD, PMD, and T0 detectors is available. It
gives a valuable estimate of the event multiplicity at high rapidities
and permits global event characterization. Together with the ZDC
information it improves the determination of the impact parameter,
number of participants, and number of binary collisions.

The event plane orientation is calculated by the {\tt AliFlowAnalysis} class.

\subsection{Comparison between reconstructed and simulated parameters}

The comparison between the reconstructed and simulated parameters is
an important part of the analysis. It is the only way to estimate the
precision of the reconstruction. Several example macros exist in
\aliroot and can be used for this purpose: {\tt AliTPCComparison.C},
{\tt AliITSComparisonV2.C}, etc. As a first step in each of these
macros the list of so-called `good tracks' is built. The definition of
a good track is explained in detail in the ITS\cite{CH6Ref:ITS_TDR} and 
TPC\cite{CH6Ref:TPC_TDR} Technical Design
Reports.  The essential point is that the track
goes through the detector and can be reconstructed. Using the `good
tracks' one then estimates the efficiency of the reconstruction and
the resolution.

Another example is specific to the MUON arm: the {\tt MUONRecoCheck.C}
macro compares the reconstructed muon tracks with the simulated ones.

There is also the possibility to calculate directly the resolutions without
additional requirements on the initial track. One can use the
so-called track label and retrieve the corresponding simulated
particle directly from the particle stack ({\tt AliStack}).

\subsection{Event mixing}

One particular analysis approach in heavy-ion physics is the
estimation of the combinatorial background using event mixing. Part of the
information (for example the positive tracks) is taken from one
event, another part (for example the negative tracks) is taken from
a different, but 
`similar' event. The event `similarity' is very important, because
only in this case the combinations produced from different events
represent the combinatorial background. Typically `similar' in
the example above means with the same multiplicity of negative
tracks. One may require in addition similar impact parameters of the
interactions, rotation of the tracks of the second event to adjust the
event plane, etc. The possibility for event mixing is provided in
\aliroot by the fact that the ESD is stored in trees and one can chain
and access simultaneously many ESD objects. Then the first pass would
be to order the events according to the desired criterion of
`similarity' and to use the obtained index for accessing the `similar'
events in the embedded analysis loops. An example of event mixing is
shown in Fig.~\ref{CH6Fig:phipp}. The background distribution has been
obtained using `mixed events'. The signal distribution has been taken
directly from the \MC simulation. The `experimental distribution' has
been produced by the analysis macro and decomposed as a
superposition of the signal and background histograms.

\begin{figure}[htb]
\centering
\includegraphics*[width=120mm]{chap6fig/phipp.eps}
\caption{Mass spectrum of the ${\rm \phi}$ meson candidates produced
  inclusively in the proton--proton interactions.}
\label{CH6Fig:phipp}
\end{figure}


\subsection{Analysis of the High-Level Trigger (HLT) data}

This is a specific analysis which is needed in order to adjust the cuts
in the HLT code, or to estimate the HLT
efficiency and resolution. \aliroot provides a transparent way of doing
such an analysis, since the HLT information is stored in the form of ESD
objects in a parallel tree. This also helps in the monitoring and
visualization of the results of the HLT algorithms.


\vspace{-0.1cm}
\subsection{Visualization}

The visualization classes give the possibility for prompt inspection of the
simulation and reconstruction results. The initial version of the
visualization is available in the {\tt AliDisplay} class. Another more
elaborated module {\tt DISPLAY} is under development.

\vspace{-0.2cm}
\section{Existing analysis examples in \aliroot}

There are several dedicated analysis tools available in \aliroot. Their results
were used in the Physics Performance Report and described in
 ALICE internal notes. There are two main classes of analysis: the
first one based directly on ESD, and the second one extracting first
AOD, and then analysing it.

\begin{itemize}
\item{\bf ESD analysis }

\begin{itemize}
\item[ ] {\bf ${\rm V^0}$ and cascade reconstruction/analysis}

  The ${\rm V^0}$ candidates
  are reconstructed during the combined barrel tracking and stored in 
  the ESD object.  The following criteria are used for the selection:
  minimal-allowed impact parameter (in the transverse plane) for each
  track; maximal-allowed DCA between the two tracks;  maximal-allowed
  cosine of the 
  ${\rm V^0}$ pointing angle 
  (angle between the momentum vector of the particle combination
   and the line connecting the production and decay vertices);  minimal
  and maximal radius of the fiducial volume; maximal-allowed ${\rm
  \chi^2}$. The 
  last criterion requires the covariance matrix of track parameters,
  which is available only in {\tt AliESDtrack}. The reconstruction
  is performed by {\tt AliV0vertexer}. This class can be used also
  in the analysis. An example of reconstructed kaons taken directly
  from the ESDs is shown in Fig.\ref{CH6Fig:kaon}. 

\begin{figure}[th]
\centering
\includegraphics*[width=120mm]{chap6fig/kaon.eps}
\caption{Mass spectrum of the ${\rm K_S^0}$ meson candidates produced
  inclusively in the \mbox{Pb--Pb} collisions.}
\label{CH6Fig:kaon}
\end{figure}

  The cascade hyperons are reconstructed using the ${\rm V^0}$ candidate and
  `bachelor' track selected according to the cuts above. In addition,
  one requires that the reconstructed ${\rm V^0}$ effective mass belongs to
  a certain interval centred in the true value.  The reconstruction
  is performed by {\tt AliCascadeVertexer}, and this class can be
  used in the analysis.

\item[ ] {\bf Open charm}

  This is the second elaborated example of ESD
  analysis. There are two classes, {\tt AliD0toKpi} and
  {\tt AliD0toKpiAnalysis}, which contain the corresponding analysis
  code. The decay under investigation is ${\rm D^0 \to K^- \pi^+}$ and its
  charge conjugate. Each ${\rm D^0}$ candidate is formed by a positive and
  a negative track, selected to fulfil the following requirements:
  minimal-allowed track transverse momentum, minimal-allowed track
  impact parameter in the transverse plane with respect to the primary
  vertex. The selection criteria for each combination include
  maximal-allowed distance of closest approach between the two tracks,
  decay angle of the kaon in the ${\rm D^0}$ rest frame in a given region,
  product of the impact parameters of the two tracks larger than a given value,
  pointing angle between the ${\rm D^0}$ momentum and flight-line smaller than
  a given value. The particle
  identification probabilities are used to reject the wrong
  combinations, namely  ${\rm (K,K)}$ and ${\rm (\pi,\pi)}$, and to enhance the
  signal-to-background ratio at low momentum by requiring the kaon
  identification. All proton-tagged tracks are excluded before the
  analysis loop on track pairs.  More details can be found in
  Ref.\cite{CH6Ref:Dainese}.

\item[ ] {\bf Quarkonia analysis}

Muon tracks stored in the ESD can be analysed by the macro
{\tt MUONmassPlot\_ESD.C}.
This macro performs an invariant-mass analysis of muon unlike-sign pairs
and calculates the combinatorial background.
Quarkonia \pt and rapidity distribution are built for ${\rm
  J/\psi}$ and ${\rm \Upsilon}$.
This macro also performs a fast single-muon analysis: \pt,
rapidity, and 
${\rm \theta}$ vs ${\rm \varphi}$ acceptance distributions for positive
and negative muon 
tracks with a maximal-allowed ${\rm \chi^2}$.

\end{itemize}

%\newpage
\item{\bf AOD analysis}

Often only a small subset of information contained in the ESD
is needed to perform an analysis. This information
can be extracted and stored in the AOD format in order to reduce
the computing resources needed for the analysis.

The AOD analysis framework implements a set of tools like data readers,
converters, cuts, and other utility classes.
The design is based on two main requirements: flexibility and common
AOD particle interface. This guarantees that several analyses can be
done in sequence within the same computing session.

In order to fulfil the first requirement, the analysis is driven by the
`analysis manager' class and particular analyses are added to it.
It performs the loop over events, which are delivered by an
user-specified reader. This design allows the analyses to be ordered
appropriately  if some  of them depend on the results of the others.

The cuts are designed to provide high flexibility
and performance. A two-level architecture has been adopted
for all the cuts (particle, pair and event). A class representing a cut
has a list of `base cuts'. Each base cut implements a cut on a
single property or performs a logical operation (and, or) on the result of
other base cuts.

A class representing a pair of particles buffers all the results,
so they can be re-used if required.

\vspace{-0.2cm}
\begin{itemize}
\item[ ] {\bf Particle momentum correlations (HBT) -- HBTAN module}

Particle momentum correlation analysis is based on the event-mixing technique.
It allows one to extract the signal by dividing the appropriate
particle spectra coming from the original events by those from the
mixed events.

Two analysis objects are currently implemented to perform the mixing:
the standard one and the one implementing the Stavinsky
algorithm\cite{CH6Ref:Stavinsky}. Others can easily be added if needed.

An extensive hierarchy of the function base classes has been implemented
facilitating the creation of new functions.
A wide set of the correlation, distribution and monitoring
functions is already available in the module. See Ref.\cite{CH6Ref:HBTAN}
for the details. 

The package contains two implementations of weighting algorithms, used
for correlation simulations (the first developed by Lednicky
\cite{CH6Ref:Weights}  and the second due to CRAB \cite{CH6Ref:CRAB}), both
based on an uniform interface.

\item[ ] {\bf Jet analysis}

The jet analysis\cite{CH6Ref:Loizides} is available in the module JETAN. It has a set of
readers of the form {\tt AliJetParticlesReader<XXX>}, where {\tt XXX}
= {\tt ESD},
{\tt HLT}, {\tt KineGoodTPC}, {\tt Kine}, derived from the base class
{\tt AliJetParticlesReader}. These
provide an uniform interface to
the information from the 
kinematics tree, from HLT, and from the ESD. The first step in the
analysis is the creation of an AOD object: a tree containing objects of
type {\tt AliJetEventParticles}. The particles are selected using a
cut on the minimal-allowed transverse momentum. The second analysis
step consists of jet finding. Several algorithms are available in the
classes of the type {\tt Ali<XXX>JetFinder}.
An example of AOD creation is provided in
the {\tt createEvents.C} macro. The usage of jet finders is illustrated in
{\tt findJets.C} macro.


\item[ ] {\bf ${\rm V^0}$ AODs}

The AODs for ${\rm V^0}$ analysis contain several additional parameters,
calculated and stored for fast access. The methods of the class {\tt
  AliAODv0} provide access to all the geometrical and kinematics
parameters of a ${\rm V^0}$ candidate, and to the ESD information used
for the calculations.

\vspace{-0.1cm}
\item[ ] {\bf MUON}

There is also a prototype MUON analysis provided in
{\tt AliMuonAnalysis}. It simply fills several histograms, namely
the transverse momentum and rapidity for positive and negative muons,
the invariant mass of the muon pair, etc.
\end{itemize}

\end{itemize}

The analysis framework is one of the most active fields of
development. Many new classes and macros are in preparation. Some of
them are already tested on the data produced during the Physics Data
Challenge 2004 and will become part of the ALICE software.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

